Abstract:This paper tackles the challenging yet important task of Visual Grounding (VG), which aims to localize a visual region in the given image referred by a natural language query. Existing efforts on the VG task are twofold: 1) two-stage methods first extract region proposals and then rank them according to their similarities with the referring expression, which usually leads to suboptimal results due to the quality of region proposals; 2) one-stage methods usually predict all the possible coordinates of the target region online by leveraging modern object detection architectures, which pay few attention to cross-modality correlations and have limited generalization ability. To better address the task, we present an effective transformer-based end-to-end visual grounding approach, which focuses on capturing the cross-modality correlations between the referring expression and visual regions for accurately reasoning the location of the target region. Specifically, our model consists of a feature encoder, a cross-modality interactor, and a modality-agnostic decoder. The feature encoder is employed to capture the intra-modality correlation, which models the linguistic context in query and the spatial dependency in image respectively. The cross-modality interactor endows the model with the capability of highlighting the localization-relevant visual and textual cues by mutual verification of vision and language, which plays a key role in our model. The decoder learns a consolidated token representation enriched by multi-modal contexts and further directly predicts the box coordinates. Extensive experiments on five public benchmark datasets with quantitative and qualitative analysis clearly demonstrate the effectiveness and rationale of our proposed method.

3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds

Text to Point Cloud Localization with Relation-Enhanced Transformer.

TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding

Transformer-based Visual Grounding with Cross-modality Interaction

Multi-view transformer for 3d visual grounding

Exploiting Contextual Objects and Relations for 3D Visual Grounding.

Reimagining 3D Visual Grounding: Instance Segmentation and Transformers for Fragmented Point Cloud Scenarios.

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

Multi-Attribute Interactions Matter for 3D Visual Grounding

End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning

Context Disentangling and Prototype Inheriting for Robust Visual Grounding

TransVG: End-to-End Visual Grounding with Transformers

Dense Object Grounding in 3D Scenes

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

Can Transformers Capture Spatial Relations between Objects?

Mono3DVG: 3D Visual Grounding in Monocular Images

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

Data-Efficient 3D Visual Grounding via Order-Aware Referring