Abstract:Recently, generative adversarial network (GAN) has shown its strong ability on modeling data distribution via adversarial learning. Cross-modal GAN, which attempts to utilize the power of GAN to model the cross-modal joint distribution and to learn compatible cross-modal features, is becoming the research hotspot. However, the existing cross-modal GAN approaches typically 1) require labeled multimodal data of massive labor cost to establish cross-modal correlation; 2) utilize the vanilla GAN model that results in unstable training procedure and meaningless synthetic features; and 3) lack of extensibility for retrieving cross-modal data of new classes. In this article, we revisit the adversarial learning in existing cross-modal GAN methods and propose Joint Feature Synthesis and Embedding (JFSE), a novel method that jointly performs multimodal feature synthesis and common embedding space learning to overcome the above three shortcomings. Specifically, JFSE deploys two coupled conditional Wassertein GAN modules for the input data of two modalities, to synthesize meaningful and correlated multimodal features under the guidance of the word embeddings of class labels. Moreover, three advanced distribution alignment schemes with advanced cycle-consistency constraints are proposed to preserve the semantic compatibility and enable the knowledge transfer in the common embedding space for both the true and synthetic cross-modal features. All these add-ons in JFSE not only help to learn more effective common embedding space that effectively captures the cross-modal correlation but also facilitate to transfer knowledge to multimodal data of new classes. Extensive experiments are conducted on four widely used cross-modal datasets, and the comparisons with more than ten state-of-the-art approaches show that our JFSE method achieves remarkably accuracy improvement on both standard retrieval and the newly explored zero-shot and generalized zero-shot retrieval tasks.

Learning Controlled Semantic Embedding for Cross-Modal Retrieval.

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation.

Semantics Disentangling for Cross-Modal Retrieval

Semantic-enhanced discriminative embedding learning for cross-modal retrieval

Coordinated and specific autoencoder for cross-modal retrieval

Exploring Graph-Structured Semantics for Cross-Modal Retrieval

Weighted Graph-structured Semantics Constraint Network for Cross-Modal Retrieval

Deep Multigraph Hierarchical Enhanced Semantic Representation for Cross-Modal Retrieval

Cross-modal Semantic Autoencoder with Embedding Consensus

Deep Multi-Graph Hierarchical Enhanced Semantic Representation for Cross-Modal Retrieval

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Rethinking Label-Wise Cross-Modal Retrieval from A Semantic Sharing Perspective

Cross‐modal Semantic Correlation Learning by Bi‐CNN Network

Towards Learning a Semantic-Consistent Subspace for Cross-Modal Retrieval.

Semi-supervised Cross-Modal Retrieval with Graph-Based Semantic Alignment Network

Learning Discriminative Representations for Semantic Cross Media Retrieval

Semantic Consistent Adversarial Cross-Modal Retrieval Exploiting Semantic Similarity

Dual graph-structured semantics multi-subspace learning for cross-modal retrieval

Enhanced Isomorphic Semantic Representation For Cross-Media Retrieval

Joint Feature Synthesis and Embedding: Adversarial Cross-Modal Retrieval Revisited

Graph Embedding Learning for Cross-Modal Information Retrieval.