Abstract:Cross-modal retrieval tasks, which are more natural and challenging than traditional retrieval tasks, have attracted increasing interest from researchers in recent years. Although different modalities with the same semantics have some potential relevance, the feature space heterogeneity still seriously weakens the performance of cross-modal retrieval models. To solve this problem, common space-based methods in which multimodal data is projected into a learned common space for similarity measurement have become the mainstream approach for cross-modal retrieval tasks. However, current methods entangle the modality style and semantic content in the common space and neglect to fully explore the semantic and discriminative representation/reconstruction of the semantic content. This often results in an unsatisfactory retrieval performance. To solve these issues, this paper proposes a new Deep Supervised Dual Cycle Adversarial Network (DSDCAN) model based on common space learning. It is composed of two cross-modal cycle GANs, one for the image and one for the text. The proposed cycle GAN model disentangles the semantic content and modality style features by making the data of one modality well reconstructed from the extracted modal style feature and the content feature of the other modality. Then, a discriminative semantic and label loss is proposed by fully considering the category, sample contrast, and label supervision to enhance the semantic discrimination of the common space representation. Besides this, to make the data distribution between two modalities similar, a second-order similarity is presented as a distance measurement of the cross-modal representation in the common space. Extensive experiments have been conducted on the Wikipedia, Pascal Sentence, NUS-WIDE-10k, PKU XMedia, MSCOCO, NUS-WIDE, Flickr30k and MIRFlickr datasets. The results demonstrate that the proposed method can achieve a higher performance than the state-of-the-art methods.

Cross-modal retrieval by an end to end way

Semantic Consistency Hashing for Cross-Modal Retrieval

Semantic Modeling of Textual Relationships in Cross-modal Retrieval

Modality-Specific Cross-Modal Similarity Measurement With Recurrent Attention Network

Modality-dependent Cross-media Retrieval

Adversarial Cross-Modal Retrieval via Learning and Transferring Single-Modal Similarities

Cross-Modality Matching Based On Fisher Vector With Neural Word Embeddings And Deep Image Features

Cross‐modal retrieval with dual multi‐angle self‐attention

Cross-Modal Image-Text Retrieval with Semantic Consistency

Cross Domain Search by Exploiting Wikipedia.

Deep Supervised Dual Cycle Adversarial Network for Cross-Modal Retrieval

Deep Supervised Cross-Modal Retrieval

Cross-Modal Retrieval: A Systematic Review of Methods and Future Directions

Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Multi-task hierarchical convolutional network for visual-semantic cross-modal retrieval

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Multi-step Self-attention Network for Cross-modal Retrieval Based on a Limited Text Space.

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation.

Analyzing semantic correlation for cross-modal retrieval