Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding

Ruigeng Zeng,Wentao Ma,Xiaoqian Wu,Wei Liu,Jie Liu

DOI: https://doi.org/10.3390/electronics13020300

IF: 2.9

2024-01-10

Electronics

Abstract:Image–text cross-modal retrieval aims to bridge the semantic gap between different modalities, allowing for the search of images based on textual descriptions or vice versa. Existing efforts in this field concentrate on coarse-grained feature representation and then utilize pairwise ranking loss to pull image–text positive pairs closer, pushing negative ones apart. However, using pairwise ranking loss directly on coarse-grained representation lacks reliability as it disregards fine-grained information, posing a challenge in narrowing the semantic gap between image and text. To this end, we propose an Instance Contrastive Embedding (IConE) method for image–text cross-modal retrieval. Specifically, we first transfer the multi-modal pre-training model to the cross-modal retrieval task to leverage the interactive information between image and text, thereby enhancing the model's representational capabilities. Then, to comprehensively consider the feature distribution of intra- and inter-modality, we design a novel two-stage training strategy that combines instance loss and contrastive loss, dedicated to extracting fine-grained representation within instances and bridging the semantic gap between modalities. Extensive experiments on two public benchmark datasets, Flickr30k and MS-COCO, demonstrate that our IConE outperforms several state-of-the-art (SoTA) baseline methods and achieves competitive performance.

engineering, electrical & electronic,computer science, information systems,physics, applied

What problem does this paper attempt to address?

The paper introduces a new method called Instance Contrastive Embedding (IConE) for image-text cross-modal retrieval. The goal is to bridge the semantic gap between different modalities (images and text), enabling the search of images based on textual descriptions or vice versa. ### Problem Statement The paper identifies two main challenges in existing approaches: 1. **Insufficient Semantic Interaction Between Image and Text**: Most existing methods use a dual-tower structure, where separate encoders (like CNN and BERT) extract features from images and text independently. This structure lacks interaction between modalities, leading to a loss of inter-modality correlation information. 2. **Overlooking the Feature Representation Distribution of Intra-Modality**: Existing methods primarily focus on inter-modality distances using pairwise ranking loss, but they do not explicitly consider the distribution of intra-modality feature representation. This can cause issues in distinguishing subtle differences between semantically similar instances. ### Proposed Solution: IConE Method To address these challenges, the authors propose the IConE method, which includes the following key components: 1. **Multi-Modal Pre-Training Knowledge Transfer**: The method leverages knowledge from multi-modal pre-training models (such as CLIP) and transfers it to the cross-modal retrieval task. This enhances feature representation and compensates for the lack of inter-modality interaction in t

Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation.

CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training

Cross-modal Image Retrieval with Deep Mutual Information Maximization

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Cross-Modal Image-Text Retrieval with Semantic Consistency

Dual-path Convolutional Image-Text Embeddings with Instance Loss

Iterative Uni-modal and Cross-modal Clustered Contrastive Learning for Image-text Retrieval

Regularizing Visual Semantic Embedding with Contrastive Learning for Image-Text Matching

Semantic-enhanced discriminative embedding learning for cross-modal retrieval

AsCL: An Asymmetry-sensitive Contrastive Learning Method for Image-Text Retrieval with Cross-Modal Fusion

Modality-Invariant Image-Text Embedding for Image-Sentence Matching

Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval

Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective

Improving Cross-Modal Image-Text Retrieval With Teacher-Student Learning

Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Masking-Based Cross-Modal Remote Sensing Image–Text Retrieval via Dynamic Contrastive Learning

Fast, Accurate, and Lightweight Memory-Enhanced Embedding Learning Framework for Image-Text Retrieval

Cross-modal Semantic Interference Suppression for image-text matching