Abstract:Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image. Most existing methods use independent image-text encoding and apply complex hand-crafted modules or encoder-decoder architectures for modal interaction and query reasoning. However, their performance significantly drops when dealing with complex textual expressions. This is because the former paradigm only utilizes limited downstream data to fit the multi-modal feature fusion. Therefore, it is only effective when the textual expressions are relatively simple. In contrast, given the wide diversity of textual expressions and the uniqueness of downstream training data, the existing fusion module, which extracts multimodal content from a visual-linguistic context, has not been fully investigated. In this paper, we present a simple yet robust transformer-based framework, SimVG, for visual grounding. Specifically, we decouple visual-linguistic feature fusion from downstream tasks by leveraging existing multimodal pre-trained models and incorporating additional object tokens to facilitate deep integration of downstream and pre-training tasks. Furthermore, we design a dynamic weight-balance distillation method in the multi-branch synchronous learning process to enhance the representation capability of the simpler branch. This branch only consists of a lightweight MLP, which simplifies the structure and improves reasoning speed. Experiments on six widely used VG datasets, i.e., RefCOCO/+/g, ReferIt, Flickr30K, and GRefCOCO, demonstrate the superiority of SimVG. Finally, the proposed method not only achieves improvements in efficiency and convergence speed but also attains new state-of-the-art performance on these benchmarks. Codes and models will be available at \url{<a class="link-external link-https" href="https://github.com/Dmmm1997/SimVG" rel="external noopener nofollow">this https URL</a>}.

Transformer-based Visual Grounding with Cross-modality Interaction

GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning

TransVG: End-to-End Visual Grounding with Transformers

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

Visual-Semantic Graph Matching for Visual Grounding

Visual Grounding With Joint Multimodal Representation and Interaction

Language Query-Based Transformer With Multiscale Cross-Modal Alignment for Visual Grounding on Remote Sensing Images

An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding

SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion

Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation

Context Disentangling and Prototype Inheriting for Robust Visual Grounding

Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision

Visual Grounding with Attention-Driven Constraint Balancing

Multi-View Transformer for 3D Visual Grounding

Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Joint Visual Grounding with Language Scene Graphs

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding

Learning Visual Grounding from Generative Vision and Language Model