Abstract:This paper tackles the challenging yet important task of Visual Grounding (VG), which aims to localize a visual region in the given image referred by a natural language query. Existing efforts on the VG task are twofold: 1) two-stage methods first extract region proposals and then rank them according to their similarities with the referring expression, which usually leads to suboptimal results due to the quality of region proposals; 2) one-stage methods usually predict all the possible coordinates of the target region online by leveraging modern object detection architectures, which pay few attention to cross-modality correlations and have limited generalization ability. To better address the task, we present an effective transformer-based end-to-end visual grounding approach, which focuses on capturing the cross-modality correlations between the referring expression and visual regions for accurately reasoning the location of the target region. Specifically, our model consists of a feature encoder, a cross-modality interactor, and a modality-agnostic decoder. The feature encoder is employed to capture the intra-modality correlation, which models the linguistic context in query and the spatial dependency in image respectively. The cross-modality interactor endows the model with the capability of highlighting the localization-relevant visual and textual cues by mutual verification of vision and language, which plays a key role in our model. The decoder learns a consolidated token representation enriched by multi-modal contexts and further directly predicts the box coordinates. Extensive experiments on five public benchmark datasets with quantitative and qualitative analysis clearly demonstrate the effectiveness and rationale of our proposed method.

Learning to Assemble Neural Module Tree Networks for Visual Grounding.

GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

Jointly Learning Truth-Conditional Denotations and Groundings using Parallel Attention

Visual-Semantic Graph Matching for Visual Grounding

Learning to Compose and Reason with Language Tree Structures for Visual Grounding.

Visualizing and Understanding Neural Models in NLP

Recursive Grounding Pruning Input Language the skis of the man in the red jacket skis of man in red jacket RvG-Tree Constructor

Joint Visual Grounding with Language Scene Graphs

Interpretable Visual Question Answering by Reasoning on Dependency Trees

End-to-End Modeling Via Information Tree for One-Shot Natural Language Spatial Video Grounding

Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding

vtGraphNet: Learning weakly-supervised scene graph for complex visual grounding

Learning to Reason: End-to-End Module Networks for Visual Question Answering

End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning

Multi-Granularity Modularized Network for Abstract Visual Reasoning

From Shallow to Deep: Compositional Reasoning over Graphs for Visual Question Answering

Transformer-based Visual Grounding with Cross-modality Interaction

Visual Grounding With Joint Multimodal Representation and Interaction

Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners

Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding