OTCMR: Bridging Heterogeneity Gap with Optimal Transport for Cross-modal Retrieval

Mingyang Li,Shao-Lun Huang,Lin Zhang
DOI: https://doi.org/10.1145/3459637.3482158
2021-01-01
Abstract:Cross-modal retrieval is a classic task in the multimedia community, which aims to search for semantically similar results from different modalities. The core of cross-modal retrieval is to learn the most correlated features in a common feature space for the multi-modal data so that the similarity can be directly measured. In this paper, we propose a novel model using optimal transport for bridging the heterogeneity gap in cross-modal retrieval tasks. Specifically, we calculate the optimal transport plans between feature distributions of different modalities and then minimize the transport cost by optimizing the feature embedding functions. In this way, the feature distributions of multi-modal data can be well aligned in the common feature space. In addition, our model combines the complementary losses in different levels: 1) semantic level, 2) distributional level, and 3) pairwise level for improving cross-modal retrieval performance. In extensive experiments, our method outperforms many other cross-modal retrieval methods, which proves the efficacy of using optimal transport in cross-modal retrieval tasks.
What problem does this paper attempt to address?