Adversarial Learning For Cross-Modal Retrieval With Wasserstein Distance

Qingrong Cheng,Youcai Zhang,Xiaodong Gu
DOI: https://doi.org/10.1007/978-3-030-36708-4_2
2019-01-01
Abstract:This paper presents a novel approach for cross-modal retrieval in an Adversarial Learning with Wasserstein Distance (ALWD) manner, which aims at learning aligned representation for various modalities in a GAN framework. The generator projects the image and the text features into an aligned representation space, while the discriminator ensures that the image and text features are not too far from each other, in a way which would maintain the semantic relation between the input samples. That is, ALWD reformulates the cross-modal retrieval as an image-text domain adaptation problem aiming at reducing domain discrepancy. To learn domain invariant representations, a domain critic network is adopted to estimate Wasserstein distance between different modal distributions and the feature extractor network is optimized to minimize the Wasserstein distance under an adversarial manner. Meanwhile, ALWD introduces additive margin softmax function to make sure the learned representations should also be discriminative in label prediction. Furthermore, a structure preservation constraint is imposed to keep local structure consistent during the learning process. Extensive comparison experiments on three widely used datasets demonstrate that ALWD outperforms the state-of-the-art cross-modal retrieval methods.
What problem does this paper attempt to address?