AGREE: Aligning Cross-Modal Entities for Image-Text Retrieval Upon Vision-Language Pre-trained Models

Xiaodan Wang,Lei Li,Zhixu Li,Xuwu Wang,Xiangru Zhu,Chengyu Wang,Jun Huang,Yanghua Xiao
DOI: https://doi.org/10.1145/3539597.3570481
2023-01-01
Abstract:Image-text retrieval is a challenging cross-modal task that arouses much attention. While the traditional methods cannot break down the barriers between different modalities, Vision-Language Pre-trained (VLP) models greatly improve image-text retrieval performance based on massive image-text pairs. Nonetheless, the VLP-based methods are still prone to produce retrieval results that cannot be cross-modal aligned with entities. Recent efforts try to fix this problem at the pre-training stage, which is not only expensive but also unpractical due to the unavailable of full datasets. In this paper, we novelly propose a lightweight and practical approach to align cross-modal entities for image-text retrieval upon VLP models only at the fine-tuning and re-ranking stages. We employ external knowledge and tools to construct extra fine-grained image-text pairs, and then emphasize cross-modal entity alignment through contrastive learning and entity-level mask modeling in fine-tuning. Besides, two re-ranking strategies are proposed, including one specially designed for zero-shot scenarios. Extensive experiments with several VLP models on multiple Chinese and English datasets show that our approach achieves state-of-the-art results in nearly all settings.
What problem does this paper attempt to address?