Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation

Jiaxi Wang,Wenhui Hu,Xueyang Liu,Beihu Wu,Yuting Qiu,YingYing Cai
2024-07-07
Abstract:Visual grounding aims to align visual information of specific regions of images with corresponding natural language expressions. Current visual grounding methods leverage pre-trained visual and language backbones independently to obtain visual features and linguistic features. Although these two types of features are then fused through elaborately designed networks, the heterogeneity of the features renders them unsuitable for multi-modal reasoning. This problem arises from the domain gap between the single-modal pre-training backbones used in current visual grounding methods, which can hardly be bridged by the traditional end-to-end training method. To alleviate this, our work proposes an Empowering Pre-trained Model for Visual Grounding (EpmVG) framework, which distills a multimodal pre-trained model to guide the visual grounding task. EpmVG relies on a novel cross-modal distillation mechanism that can effectively introduce the consistency information of images and texts from the pre-trained model, reducing the domain gap in the backbone networks, and thereby improving the performance of the model in the visual grounding task. Extensive experiments have been conducted on five conventionally used datasets, and the results demonstrate that our method achieves better performance than state-of-the-art methods.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper mainly discusses a problem in the field of visual grounding, which is the performance degradation caused by modality gap. Current methods rely on separately pre-trained visual and language models to obtain features, but the heterogeneity between these features makes them unsuitable for cross-modal reasoning. To solve this problem, the paper proposes a framework called "Empowering Pre-trained Model for Visual Grounding (EpmVG)" which introduces the consistency information between images and texts in pre-trained models through a new cross-modal distillation loss (CD), reducing the domain gap in the backbone network and improving the performance of the model on visual grounding tasks. Specifically, EpmVG uses the visual and textual encoders of the frozen CLIP model to generate soft labels, which constrain the visual branch and the language branch. Experimental results show that this method effectively reduces the modality gap between images and texts, promotes cross-modal alignment between queries and relevant regions, and outperforms existing state-of-the-art methods on five commonly used visual grounding datasets. In addition, the paper also compares single-stage and two-stage visual grounding methods, and introduces related works on knowledge distillation. The contribution of EpmVG lies in analyzing the problems existing in the pre-training stage, proposing a framework that transfers the correlation between images and texts through cross-modal distillation, and proving its advantages through experiments.