Abstract:Image-text retrieval is a central problem for understanding the semantic relationship between vision and language, and serves as the basis for various visual and language tasks. Most previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words. However, the close relations between coarse- and fine-grained representations for each modality are important for image-text retrieval but almost neglected. As a result, such previous works inevitably suffer from low retrieval accuracy or heavy computational cost. In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework. This framework is consistent with human cognition, as humans simultaneously pay attention to the entire sample and regional elements to understand the semantic content. To this end, a Token-Guided Dual Transformer (TGDT) architecture which consists of two homogeneous branches for image and text modalities, respectively, is proposed for image-text retrieval. The TGDT incorporates both coarse- and fine-grained retrievals into a unified framework and beneficially leverages the advantages of both retrieval approaches. A novel training objective called Consistent Multimodal Contrastive (CMC) loss is proposed accordingly to ensure the intra- and inter-modal semantic consistencies between images and texts in the common embedding space. Equipped with a two-stage inference method based on the mixed global and local cross-modal similarity, the proposed method achieves state-of-the-art retrieval performances with extremely low inference time when compared with representative recent approaches. Code is publicly available: github.com/LCFractal/TGDT.

CAliC: Accurate and Efficient Image-Text Retrieval Via Contrastive Alignment and Visual Contexts Modeling

CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval

COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions

Iterative Uni-modal and Cross-modal Clustered Contrastive Learning for Image-text Retrieval

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training

Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval

Visual context learning based on textual knowledge for image-text retrieval

Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

ViLEM: Visual-Language Error Modeling for Image-Text Retrieval

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

Image–Text Matching Model Based on CLIP Bimodal Encoding

Image-Text Matching with Multi-View Attention

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

Improving Cross-Modal Image-Text Retrieval With Teacher-Student Learning

AGREE: Aligning Cross-Modal Entities for Image-Text Retrieval Upon Vision-Language Pre-trained Models

Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval