Abstract:We tackle the cross-modal retrieval problem, where learning is only supervised by relevant multi-modal pairs in the data. Although the contrastive learning is the most popular approach for this task, it makes potentially wrong assumption that the instances in different pairs are automatically irrelevant. To address the issue, we propose a novel loss function that is based on self-labeling of the unknown semantic classes. Specifically, we aim to predict class labels of the data instances in each modality, and assign those labels to the corresponding instances in the other modality (i.e., swapping the pseudo labels). With these swapped labels, we learn the data embedding for each modality using the supervised cross-entropy loss. This way, cross-modal instances from different pairs that are semantically related can be aligned to each other by the class predictor. We tested our approach on several real-world cross-modal retrieval problems, including text-based video retrieval, sketch-based image retrieval, and image-text retrieval. For all these tasks our method achieves significant performance improvement over the contrastive learning.

What problem does this paper attempt to address?

This paper attempts to solve an important problem in cross - modal retrieval: existing contrastive learning methods assume that instances between different pairs are automatically unrelated, which may lead to potential errors. Specifically: 1. **Cross - modal retrieval problem**: - The goal of cross - modal retrieval is to retrieve the items most relevant to a query from one modality (e.g., image) from a database of another modality (e.g., text). - Traditional methods mainly rely on contrastive learning, that is, learning cross - modal similarity metrics by pulling closer the distance of relevant pairs and pushing away the distance of irrelevant pairs. 2. **Limitations of existing methods**: - Contrastive learning methods implicitly assume that instances between different pairs are unrelated, but this assumption may be wrong. Training data usually only contains relevant pairs, and the correlation between different pairs is not checked. - This assumption may lead to a decline in model performance in practical applications, because in fact there may be semantic correlations between different pairs. 3. **The new method (SwAMP) proposed in the paper**: - To solve the above problems, the paper proposes a new loss function SwAMP (Swapped Assignment of Multi - Modal Pairs), which is based on self - labeled unknown semantic categories. - Specifically, it predicts the class labels of data instances in each modality and assigns these labels to the corresponding instances in the other modality (i.e., swapping pseudo - labels). Then, supervised cross - entropy loss is used to learn the data embeddings of each modality. - In this way, cross - modal instances from different pairs but semantically related can be aligned by class predictors. 4. **Experimental verification**: - The paper has been tested on multiple practical cross - modal retrieval tasks, including text - based video retrieval, sketch - based image retrieval, and image - text retrieval. - The experimental results show that the SwAMP method significantly outperforms traditional contrastive learning methods on these tasks. In summary, this paper aims to solve the problem of wrong assumptions about the correlation of instances between different pairs in contrastive learning methods in cross - modal retrieval, and improve the performance of the model by introducing self - labeled semantic category labels.

SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval

Adversarial Cross-Modal Retrieval via Learning and Transferring Single-Modal Similarities

Cross-Modal Retrieval with Partially Mismatched Pairs

Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval

Adversarial Cross-Modal Retrieval

Cross-Modal Coordination Across a Diverse Set of Input Modalities

Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval

Multicenter clinical trial of implanted norethindrone pellets for long-acting contraception in women. Program for Applied Research on Fertility Regulation.

Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval

Annotation Efficient Cross-Modal Retrieval with Adversarial Attentive Alignment

CMPD: Using Cross Memory Network With Pair Discrimination for Image-Text Retrieval

Rethinking Label-Wise Cross-Modal Retrieval from A Semantic Sharing Perspective

Deep Supervised Cross-Modal Retrieval

Weakly-paired Deep Dictionary Learning for Cross-Modal Retrieval

Cross-modal Deep Metric Learning with Multi-Task Regularization

A semi-supervised cross-modal memory bank for cross-modal retrieval

Adaptive Marginalized Semantic Hashing for Unpaired Cross-Modal Retrieval

Learning Discriminative Representations for Semantic Cross Media Retrieval

Cross-Modal Contrastive Learning for Domain Adaptation in 3D Semantic Segmentation.

Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval