Abstract:Cross-modal retrieval tasks, which are more natural and challenging than traditional retrieval tasks, have attracted increasing interest from researchers in recent years. Although different modalities with the same semantics have some potential relevance, the feature space heterogeneity still seriously weakens the performance of cross-modal retrieval models. To solve this problem, common space-based methods in which multimodal data is projected into a learned common space for similarity measurement have become the mainstream approach for cross-modal retrieval tasks. However, current methods entangle the modality style and semantic content in the common space and neglect to fully explore the semantic and discriminative representation/reconstruction of the semantic content. This often results in an unsatisfactory retrieval performance. To solve these issues, this paper proposes a new Deep Supervised Dual Cycle Adversarial Network (DSDCAN) model based on common space learning. It is composed of two cross-modal cycle GANs, one for the image and one for the text. The proposed cycle GAN model disentangles the semantic content and modality style features by making the data of one modality well reconstructed from the extracted modal style feature and the content feature of the other modality. Then, a discriminative semantic and label loss is proposed by fully considering the category, sample contrast, and label supervision to enhance the semantic discrimination of the common space representation. Besides this, to make the data distribution between two modalities similar, a second-order similarity is presented as a distance measurement of the cross-modal representation in the common space. Extensive experiments have been conducted on the Wikipedia, Pascal Sentence, NUS-WIDE-10k, PKU XMedia, MSCOCO, NUS-WIDE, Flickr30k and MIRFlickr datasets. The results demonstrate that the proposed method can achieve a higher performance than the state-of-the-art methods.

Semi-Supervised Coupled Dictionary Learning For Cross-Modal Retrieval In Internet Images And Texts

Incremental semi-supervised subspace learning for image retrieval.

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation.

Image retrieval based on incremental subspace learning

Joint Dictionary Learning and Semantic Constrained Latent Subspace Projection for Cross-Modal Retrieval.

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Weakly-paired Deep Dictionary Learning for Cross-Modal Retrieval

UCSL: Toward Unsupervised Common Subspace Learning for Cross-Modal Image Classification

Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval.

Dual graph-structured semantics multi-subspace learning for cross-modal retrieval

Completely Unpaired Cross-Modal Hashing Based on Coupled Subspace

Discriminative Dictionary Learning with Common Label Alignment for Cross-Modal Retrieval.

Deep Supervised Dual Cycle Adversarial Network for Cross-Modal Retrieval

A semi-supervised cross-modal memory bank for cross-modal retrieval

Self-supervised Correlation Learning for Cross-Modal Retrieval

Deep Supervised Cross-Modal Retrieval

Multimodal visual dictionary learning via heterogeneous latent semantic sparse coding

Effective Deep Learning-Based Multi-Modal Retrieval

Image-text matching using multi-subspace joint representation

Semantics Disentangling for Cross-Modal Retrieval

Adversarial Cross-Modal Retrieval via Learning and Transferring Single-Modal Similarities