SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval

Minyoung Kim
DOI: https://doi.org/10.48550/arXiv.2111.05814
2022-10-12
Abstract:We tackle the cross-modal retrieval problem, where learning is only supervised by relevant multi-modal pairs in the data. Although the contrastive learning is the most popular approach for this task, it makes potentially wrong assumption that the instances in different pairs are automatically irrelevant. To address the issue, we propose a novel loss function that is based on self-labeling of the unknown semantic classes. Specifically, we aim to predict class labels of the data instances in each modality, and assign those labels to the corresponding instances in the other modality (i.e., swapping the pseudo labels). With these swapped labels, we learn the data embedding for each modality using the supervised cross-entropy loss. This way, cross-modal instances from different pairs that are semantically related can be aligned to each other by the class predictor. We tested our approach on several real-world cross-modal retrieval problems, including text-based video retrieval, sketch-based image retrieval, and image-text retrieval. For all these tasks our method achieves significant performance improvement over the contrastive learning.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve an important problem in cross - modal retrieval: existing contrastive learning methods assume that instances between different pairs are automatically unrelated, which may lead to potential errors. Specifically: 1. **Cross - modal retrieval problem**: - The goal of cross - modal retrieval is to retrieve the items most relevant to a query from one modality (e.g., image) from a database of another modality (e.g., text). - Traditional methods mainly rely on contrastive learning, that is, learning cross - modal similarity metrics by pulling closer the distance of relevant pairs and pushing away the distance of irrelevant pairs. 2. **Limitations of existing methods**: - Contrastive learning methods implicitly assume that instances between different pairs are unrelated, but this assumption may be wrong. Training data usually only contains relevant pairs, and the correlation between different pairs is not checked. - This assumption may lead to a decline in model performance in practical applications, because in fact there may be semantic correlations between different pairs. 3. **The new method (SwAMP) proposed in the paper**: - To solve the above problems, the paper proposes a new loss function SwAMP (Swapped Assignment of Multi - Modal Pairs), which is based on self - labeled unknown semantic categories. - Specifically, it predicts the class labels of data instances in each modality and assigns these labels to the corresponding instances in the other modality (i.e., swapping pseudo - labels). Then, supervised cross - entropy loss is used to learn the data embeddings of each modality. - In this way, cross - modal instances from different pairs but semantically related can be aligned by class predictors. 4. **Experimental verification**: - The paper has been tested on multiple practical cross - modal retrieval tasks, including text - based video retrieval, sketch - based image retrieval, and image - text retrieval. - The experimental results show that the SwAMP method significantly outperforms traditional contrastive learning methods on these tasks. In summary, this paper aims to solve the problem of wrong assumptions about the correlation of instances between different pairs in contrastive learning methods in cross - modal retrieval, and improve the performance of the model by introducing self - labeled semantic category labels.