Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding

Ruigeng Zeng,Wentao Ma,Xiaoqian Wu,Wei Liu,Jie Liu
DOI: https://doi.org/10.3390/electronics13020300
IF: 2.9
2024-01-10
Electronics
Abstract:Image–text cross-modal retrieval aims to bridge the semantic gap between different modalities, allowing for the search of images based on textual descriptions or vice versa. Existing efforts in this field concentrate on coarse-grained feature representation and then utilize pairwise ranking loss to pull image–text positive pairs closer, pushing negative ones apart. However, using pairwise ranking loss directly on coarse-grained representation lacks reliability as it disregards fine-grained information, posing a challenge in narrowing the semantic gap between image and text. To this end, we propose an Instance Contrastive Embedding (IConE) method for image–text cross-modal retrieval. Specifically, we first transfer the multi-modal pre-training model to the cross-modal retrieval task to leverage the interactive information between image and text, thereby enhancing the model's representational capabilities. Then, to comprehensively consider the feature distribution of intra- and inter-modality, we design a novel two-stage training strategy that combines instance loss and contrastive loss, dedicated to extracting fine-grained representation within instances and bridging the semantic gap between modalities. Extensive experiments on two public benchmark datasets, Flickr30k and MS-COCO, demonstrate that our IConE outperforms several state-of-the-art (SoTA) baseline methods and achieves competitive performance.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?
The paper introduces a new method called Instance Contrastive Embedding (IConE) for image-text cross-modal retrieval. The goal is to bridge the semantic gap between different modalities (images and text), enabling the search of images based on textual descriptions or vice versa. ### Problem Statement The paper identifies two main challenges in existing approaches: 1. **Insufficient Semantic Interaction Between Image and Text**: Most existing methods use a dual-tower structure, where separate encoders (like CNN and BERT) extract features from images and text independently. This structure lacks interaction between modalities, leading to a loss of inter-modality correlation information. 2. **Overlooking the Feature Representation Distribution of Intra-Modality**: Existing methods primarily focus on inter-modality distances using pairwise ranking loss, but they do not explicitly consider the distribution of intra-modality feature representation. This can cause issues in distinguishing subtle differences between semantically similar instances. ### Proposed Solution: IConE Method To address these challenges, the authors propose the IConE method, which includes the following key components: 1. **Multi-Modal Pre-Training Knowledge Transfer**: The method leverages knowledge from multi-modal pre-training models (such as CLIP) and transfers it to the cross-modal retrieval task. This enhances feature representation and compensates for the lack of inter-modality interaction in t