Abstract:This paper tackles the challenging yet important task of Visual Grounding (VG), which aims to localize a visual region in the given image referred by a natural language query. Existing efforts on the VG task are twofold: 1) two-stage methods first extract region proposals and then rank them according to their similarities with the referring expression, which usually leads to suboptimal results due to the quality of region proposals; 2) one-stage methods usually predict all the possible coordinates of the target region online by leveraging modern object detection architectures, which pay few attention to cross-modality correlations and have limited generalization ability. To better address the task, we present an effective transformer-based end-to-end visual grounding approach, which focuses on capturing the cross-modality correlations between the referring expression and visual regions for accurately reasoning the location of the target region. Specifically, our model consists of a feature encoder, a cross-modality interactor, and a modality-agnostic decoder. The feature encoder is employed to capture the intra-modality correlation, which models the linguistic context in query and the spatial dependency in image respectively. The cross-modality interactor endows the model with the capability of highlighting the localization-relevant visual and textual cues by mutual verification of vision and language, which plays a key role in our model. The decoder learns a consolidated token representation enriched by multi-modal contexts and further directly predicts the box coordinates. Extensive experiments on five public benchmark datasets with quantitative and qualitative analysis clearly demonstrate the effectiveness and rationale of our proposed method.

Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions

Grounding Referring Expressions in Images by Variational Context

Visual-Semantic Graph Matching for Visual Grounding

Editorial Paper for Pattern Recognition Letters VSI on Cross Model Understanding for Visual Question Answering

Referring Expression Grounding by Marginalizing Scene Graph Likelihood

Referencing Where to Focus: Improving VisualGrounding with Referential Query

Joint Visual Grounding with Language Scene Graphs

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Mind the Context - The Impact of Contextualization in Neural Module Networks for Grounding Visual Referring Expressions.

Context Disentangling and Prototype Inheriting for Robust Visual Grounding

A Context-Based Network for Referring Image Segmentation

Referring Expression Comprehension Via Co-attention and Visual Context.

Learning Visual Grounding from Generative Vision and Language Model

Linguistic Structure Guided Context Modeling for Referring Image Segmentation

Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset

Transformer-based Visual Grounding with Cross-modality Interaction

Words Aren't Enough, Their Order Matters: on the Robustness of Grounding Visual Referring Expressions.

Context-LGM: Leveraging Object-Context Relation for Context-Aware Object Recognition

Lgvc: language-guided visual context modeling for 3D visual grounding

OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling