Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation

Jiaxi Wang,Wenhui Hu,Xueyang Liu,Beihu Wu,Yuting Qiu,YingYing Cai

2024-07-07

Abstract:Visual grounding aims to align visual information of specific regions of images with corresponding natural language expressions. Current visual grounding methods leverage pre-trained visual and language backbones independently to obtain visual features and linguistic features. Although these two types of features are then fused through elaborately designed networks, the heterogeneity of the features renders them unsuitable for multi-modal reasoning. This problem arises from the domain gap between the single-modal pre-training backbones used in current visual grounding methods, which can hardly be bridged by the traditional end-to-end training method. To alleviate this, our work proposes an Empowering Pre-trained Model for Visual Grounding (EpmVG) framework, which distills a multimodal pre-trained model to guide the visual grounding task. EpmVG relies on a novel cross-modal distillation mechanism that can effectively introduce the consistency information of images and texts from the pre-trained model, reducing the domain gap in the backbone networks, and thereby improving the performance of the model in the visual grounding task. Extensive experiments have been conducted on five conventionally used datasets, and the results demonstrate that our method achieves better performance than state-of-the-art methods.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

This paper mainly discusses a problem in the field of visual grounding, which is the performance degradation caused by modality gap. Current methods rely on separately pre-trained visual and language models to obtain features, but the heterogeneity between these features makes them unsuitable for cross-modal reasoning. To solve this problem, the paper proposes a framework called "Empowering Pre-trained Model for Visual Grounding (EpmVG)" which introduces the consistency information between images and texts in pre-trained models through a new cross-modal distillation loss (CD), reducing the domain gap in the backbone network and improving the performance of the model on visual grounding tasks. Specifically, EpmVG uses the visual and textual encoders of the frozen CLIP model to generate soft labels, which constrain the visual branch and the language branch. Experimental results show that this method effectively reduces the modality gap between images and texts, promotes cross-modal alignment between queries and relevant regions, and outperforms existing state-of-the-art methods on five commonly used visual grounding datasets. In addition, the paper also compares single-stage and two-stage visual grounding methods, and introduces related works on knowledge distillation. The contribution of EpmVG lies in analyzing the problems existing in the pre-training stage, proposing a framework that transfers the correlation between images and texts through cross-modal distillation, and proving its advantages through experiments.

Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation

HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding

Visual-Semantic Graph Matching for Visual Grounding

SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion

Transformer-based Visual Grounding with Cross-modality Interaction

Visual Grounding With Joint Multimodal Representation and Interaction

Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding

TransVG: End-to-End Visual Grounding with Transformers

An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding

Language Adaptive Weight Generation for Multi-task Visual Grounding

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning

Visual Grounding Strategies for Text-Only Natural Language Processing

Cross-Modal Match for Language Conditioned 3D Object Grounding

Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

Distilled Dual-Encoder Model for Vision-Language Understanding

EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model

A Visual Attention Grounding Neural Model for Multimodal Machine Translation

CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding