Cross-modal retrieval based on shared proxies

Yuxin Wei,Ligang Zheng,Guoping Qiu,Guocan Cai
DOI: https://doi.org/10.1007/s13735-023-00316-2
2024-01-21
International Journal of Multimedia Information Retrieval
Abstract:Learning a common space that is simultaneously semantically discriminative and modality invariant stands as the primary challenge in cross-modal retrieval. Conventional approaches usually employ pairwise or triplet data relationships to learn the common space, which can only capture the data similarity locally but would be unable to effectively characterize the global geometry of the common embedding space, and thus would limit the performance of cross-modal retrieval. This paper proposes to integrate the principles of the shared proxy and neighborhood component analysis in order to learn a shared space for different modalities. The objective of this shared space is to minimize the distance between a sample's representation and its corresponding proxy, while also maximizing the distances between a sample's representation and the proxies that are not associated with the sample. Our proposed framework, named Cross-mOdal proXy learnIng (COXI), incorporates a cross-modal shared proxy loss, a discriminative loss, and a modality invariant loss to facilitate supervised cross-modal retrieval. Extensive experiments on benchmark datasets clearly shows that COXI outperforms state-of-the-art cross-modal retrieval techniques. Code is available on https://github.com/LigangZheng/COXI.
computer science, artificial intelligence, software engineering
What problem does this paper attempt to address?