Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval

Tao Yao,Shouyong Peng,Lili Wang,Ying Li,Yujuan Sun
DOI: https://doi.org/10.1007/s10489-024-05823-1
IF: 5.3
2024-10-01
Applied Intelligence
Abstract:Recent days have seen significant improvements in multi-modal learning made by Vision-Language Pre-training (VLP) models. However, most of them employ the coarse-grained global alignment to overcome semantic gap for generating common representations, which makes them inadequate to capture intrinsic semantic correlations in image-text retrieval and consequently degrading the accuracy. Moreover, it is expensive to fine-tune a VLP model to perform image-text retrieval due to its large number of parameters. In this paper, we propose a simple yet effective image-text retrieval method, termed Cross-Modality Interaction Reasoning for enhancing Vision-Language Pre-training (CMIR-VLP). Specifically, a Cross Modality Interaction Reasoning (CMIR) module, which is designed to inject fine-grained image-text associations into semantic correlations learning, integrates the patch cues into the word reasoning with a multi-modal interaction encoder. Besides, we propose a cross-interaction process to associate each local text semantics with local visual information for fine-grained image-text alignment. Extensive experiments demonstrate our method gains 52 and 97.5 over state-of-the-art non-pre-training methods on two widely used datasets, and it also outperforms several mainstream fine-tuned VIP models. The related code repository in https://github.com/PSYGIM/CMIR-VLP.
computer science, artificial intelligence
What problem does this paper attempt to address?