Dual Subspaces with Adversarial Learning for Cross-Modal Retrieval.

Yaxian Xia,Wenmin Wang,Liang Han
DOI: https://doi.org/10.1007/978-3-030-00776-8_60
2018-01-01
Abstract:Learning an effective subspace to calculate the correlation of items from different modalities is the core of cross-modal retrieval task, such as image, text or latent subspace. However, data in different modalities have imbalance and complementary relationships. Image contains abundant spatial information while text includes more background and context details. In this paper, we propose a model with dual parallel subspaces (visual and textual subspace) to better preserve modality-specific information. Triplet constraints are employed to minimize the semantic gap between items from different modalities with the same concept, while maximize that of concept-different image-text pair in corresponding subspace. Then we novelly combine adversarial learning with dual subspaces, which act as an interplay of two agents. The first agent, dual subspaces with similarity merging and concept prediction, aims to narrow the difference of data distributions from different modalities under the premise of concept invariance to fool the other agent, modality discriminator, which tries to distinguish image from text accurately. Extensive experiments on Wikipedia dataset and NUS-WIDE-10k dataset verify the effectiveness of our proposed model for cross-modal retrieval tasks, which outperforms the state-of-the-art methods.
What problem does this paper attempt to address?