Abstract:Image-text retrieval is a central problem for understanding the semantic relationship between vision and language, and serves as the basis for various visual and language tasks. Most previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words. However, the close relations between coarse- and fine-grained representations for each modality are important for image-text retrieval but almost neglected. As a result, such previous works inevitably suffer from low retrieval accuracy or heavy computational cost. In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework. This framework is consistent with human cognition, as humans simultaneously pay attention to the entire sample and regional elements to understand the semantic content. To this end, a Token-Guided Dual Transformer (TGDT) architecture which consists of two homogeneous branches for image and text modalities, respectively, is proposed for image-text retrieval. The TGDT incorporates both coarse- and fine-grained retrievals into a unified framework and beneficially leverages the advantages of both retrieval approaches. A novel training objective called Consistent Multimodal Contrastive (CMC) loss is proposed accordingly to ensure the intra- and inter-modal semantic consistencies between images and texts in the common embedding space. Equipped with a two-stage inference method based on the mixed global and local cross-modal similarity, the proposed method achieves state-of-the-art retrieval performances with extremely low inference time when compared with representative recent approaches. Code is publicly available: github.com/LCFractal/TGDT.

Bi-directional Training for Composed Image Retrieval via Text Prompt Learning

Target-Guided Composed Image Retrieval

Compositional Image Retrieval via Instruction-Aware Contrastive Learning

MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives

Pseudo-triplet Guided Few-shot Composed Image Retrieval

Self-Training Boosted Multi-Factor Matching Network for Composed Image Retrieval

Vision-by-Language for Training-Free Compositional Image Retrieval

Training-free Zero-shot Composed Image Retrieval with Local Concept Reranking

CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval

Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training

ComCLIP: Training-Free Compositional Image and Text Matching

Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity

Reducing Task Discrepancy of Text Encoders for Zero-Shot Composed Image Retrieval

BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval

Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval

Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy