Abstract:Visual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made recently, accurately localizing target objects within multiple-instance distractions (multiple objects of the same category as the target) remains a significant challenge. Existing methods demonstrate a significant performance drop when there are multiple distractions in an image, indicating an insufficient understanding of the fine-grained semantics and spatial relationships between objects. In this paper, we propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue. Firstly, we enhance the model's understanding of fine-grained semantics by injecting semantic prior information derived from text queries into the model. This is achieved by leveraging text-to-image generation models to produce images representing the semantic attributes of target objects described in queries. Secondly, we tackle the lack of training samples with multiple distractions by introducing a relation-sensitive data augmentation method. This method generates additional training data by synthesizing images containing multiple objects of the same category and pseudo queries based on their spatial relationships. The proposed ReSVG model significantly improves the model's ability to comprehend both object semantics and spatial relations, leading to enhanced performance in visual grounding tasks, particularly in scenarios with multiple-instance distractions. We conduct extensive experiments to validate the effectiveness of our methods on five datasets. Code is available at <a class="link-external link-https" href="https://github.com/minghangz/ResVG" rel="external noopener nofollow">this https URL</a>.

Reimagining 3D Visual Grounding: Instance Segmentation and Transformers for Fragmented Point Cloud Scenarios.

3D Object Segmentation Using Cross-Window Point Transformer with Latent Semantic Boundary Guidance

A Unified Framework for 3D Point Cloud Visual Grounding

RefMask3D: Language-Guided Transformer for 3D Referring Segmentation

Stratified Transformer for 3D Point Cloud Segmentation

SGIFormer: Semantic-guided and Geometric-enhanced Interleaving Transformer for 3D Instance Segmentation

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

RESSCAL3D++: Joint Acquisition and Semantic Segmentation of 3D Point Clouds

Data-Efficient 3D Visual Grounding via Order-Aware Referring

Exploiting Contextual Objects and Relations for 3D Visual Grounding.

Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization

ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding

Revisiting 3D Visual Grounding with Context-aware Feature Aggregation

Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance

Geodesic-Former: a Geodesic-Guided Few-shot 3D Point Cloud Instance Segmenter

LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers

Collect-and-Distribute Transformer for 3D Point Cloud Analysis

Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space

Position-Guided Point Cloud Panoptic Segmentation Transformer

3DFusion, A real-time 3D object reconstruction pipeline based on streamed instance segmented data

3D Visual Grounding-Audio: 3D scene object detection based on audio