Abstract:Image-text retrieval is a central problem for understanding the semantic relationship between vision and language, and serves as the basis for various visual and language tasks. Most previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words. However, the close relations between coarse- and fine-grained representations for each modality are important for image-text retrieval but almost neglected. As a result, such previous works inevitably suffer from low retrieval accuracy or heavy computational cost. In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework. This framework is consistent with human cognition, as humans simultaneously pay attention to the entire sample and regional elements to understand the semantic content. To this end, a Token-Guided Dual Transformer (TGDT) architecture which consists of two homogeneous branches for image and text modalities, respectively, is proposed for image-text retrieval. The TGDT incorporates both coarse- and fine-grained retrievals into a unified framework and beneficially leverages the advantages of both retrieval approaches. A novel training objective called Consistent Multimodal Contrastive (CMC) loss is proposed accordingly to ensure the intra- and inter-modal semantic consistencies between images and texts in the common embedding space. Equipped with a two-stage inference method based on the mixed global and local cross-modal similarity, the proposed method achieves state-of-the-art retrieval performances with extremely low inference time when compared with representative recent approaches. Code is publicly available: github.com/LCFractal/TGDT.

Cross-modal Prominent Fragments Enhancement Aligning Network for Image-text Retrieval

A New Fine-grained Alignment Method for Image-text Matching

Semantic enhancement and multi-level alignment network for cross-modal retrieval

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Cross-modal alignment with graph reasoning for image-text retrieval

Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval

A Mutually Textual and Visual Refinement Network for Image-Text Matching

Global-aware Fragment Representation Aggregation Network for image–text retrieval

HAAN: Learning a Hierarchical Adaptive Alignment Network for Image-Text Retrieval

Cross-modal Graph Matching Network for Image-text Retrieval

Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching

Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching.

FB-Net: Dual-Branch Foreground-Background Fusion Network With Multi-Scale Semantic Scanning for Image-Text Retrieval

Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval

Dual-path Rare Content Enhancement Network for Image and Text Matching

Context-Aware Attention Network for Image-Text Retrieval

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training

Cross-Modal Attention With Semantic Consistence for Image–Text Matching

Annotation Efficient Cross-Modal Retrieval with Adversarial Attentive Alignment

CMPD: Using Cross Memory Network With Pair Discrimination for Image-Text Retrieval