Abstract:Image-text retrieval requires the system to bridge the heterogenous gap between vision and language for accurate retrieval while keeping the network lightweight-enough for efficient retrieval. Existing trade-off solutions mainly study from the view of incorporating cross-modal interactions with the independent-embedding framework or leveraging stronger pretrained encoders, which still demand time-consuming similarity measurement or heavyweight model structure in the retrieval stage. In this work, we propose an image-text alignment module SelfAlign on top of the independent-embedding framework, which improves the retrieval accuracy while maintains the retrieval efficiency without extra supervision. SelfAlign contains two collaborative sub-modules that force image-text alignment at both concept level and context level by self-supervised contrastive learning. It does not require cross-modal embedding interactions during training while maintaining independent image and text encoders during retrieval. With comparable time cost, SelfAlign consistently boosts the accuracy of state-of-the-art non-pretraining independent-embedding models respectively by 9.1%, 4.2% and 6.6% in terms of R@sum score on Flickr30K, MSCOCO 1K and MS-COCO 5K datasets. The retrieval accuracy also outperforms most existing interactive-embedding models with orders of magnitude decrease in retrieval time. The source code is available at: <a class="link-external link-https" href="https://github.com/Zjamie813/SelfAlign" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to improve the retrieval accuracy while maintaining high efficiency in the image - text retrieval task. Specifically, the existing methods either focus on retrieval efficiency, such as independent embedding models. Although these models can achieve fast retrieval, they cannot provide fine - grained content alignment, resulting in low retrieval precision; or they focus on retrieval accuracy, such as interactive embedding models. These models achieve fine - grained image - text matching through cross - modal attention mechanisms, but they have high computational complexity and are not suitable for large - scale online retrieval scenarios. The paper proposes a new method, called SelfAlign, which aims to learn fine - grained image - text alignment through self - supervised contrastive learning while maintaining the efficiency of the independent embedding model. SelfAlign contains two sub - modules: 1. **Local Concept Alignment (LCA)**: This sub - module is injected at the object and word encoding layers, aiming to enforce consistency between visual and textual concept embeddings. It achieves this by discovering pseudo - word - object correspondences and using a clustering - based fine - grained alignment strategy. 2. **Contextual Relation Alignment (CRA)**: This sub - module is injected at the context encoding layer, aiming to capture semantic correspondences at the context level. It first performs shared context enhancement and then context - level alignment. Through these two sub - modules, SelfAlign can significantly improve the retrieval precision of the independent embedding model without sacrificing efficiency. The experimental results show that SelfAlign improves the R@sum scores by 9.1%, 4.2% and 6.6% respectively on multiple datasets, while the retrieval time is much lower than that of most existing interactive embedding models.

Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment

A New Fine-grained Alignment Method for Image-text Matching

HAAN: Learning a Hierarchical Adaptive Alignment Network for Image-Text Retrieval

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Align Yourself: Self-supervised Pre-training for Fine-grained Recognition via Saliency Alignment.

Align and Tell: Boosting Text-Video Retrieval With Local Alignment and Fine-Grained Supervision

Learning Relation Alignment for Calibrated Cross-modal Retrieval

Coarse-to-fine Alignment Makes Better Speech-image Retrieval

Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training

Cross-modal alignment with graph reasoning for image-text retrieval

Filter & Align: Leveraging Human Knowledge to Curate Image-Text Data

Annotation Efficient Cross-Modal Retrieval with Adversarial Attentive Alignment

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval

Similarity Reasoning and Filtration for Image-Text Matching

Fast, Accurate, and Lightweight Memory-Enhanced Embedding Learning Framework for Image-Text Retrieval