Abstract:Cross-modal retrieval has drawn wide interest for retrieval across different modalities (such as text, image, video, audio, and 3-D model). However, existing methods based on a deep neural network often face the challenge of insufficient cross-modal training data, which limits the training effectiveness and easily leads to overfitting. Transfer learning is usually adopted for relieving the problem of insufficient training data, but it mainly focuses on knowledge transfer only from large-scale datasets as a single-modal source domain (such as ImageNet) to a single-modal target domain. In fact, such large-scale single-modal datasets also contain rich modal-independent semantic knowledge that can be shared across different modalities. Besides, large-scale cross-modal datasets are very labor-consuming to collect and label, so it is significant to fully exploit the knowledge in single-modal datasets for boosting cross-modal retrieval. To achieve the above goal, this paper proposes a modal-adversarial hybrid transfer network (MHTN), which aims to realize knowledge transfer from a single-modal source domain to a cross-modal target domain and learn cross-modal common representation. It is an end-to-end architecture with two subnetworks. First, a modal-sharing knowledge transfer subnetwork is proposed to jointly transfer knowledge from a single modality in the source domain to all modalities in the target domain with a star network structure, which distills modal-independent supplementary knowledge for promoting cross-modal common representation learning. Second, a modal-adversarial semantic learning subnetwork is proposed to construct an adversarial training mechanism between the common representation generator and modality discriminator, making the common representation discriminative for semantics but indiscriminative for modalities to enhance cross-modal semantic consistency during the transfer process. Comprehensive experiments on four widely used datasets show the effectiveness of MHTN.

Multi-label adversarial fine-grained cross-modal retrieval

Integrating Multi-Label Contrastive Learning With Dual Adversarial Graph Neural Networks for Cross-Modal Retrieval

Dual Enhancement for Multi-Label Learning with Missing Labels

Dual Adversarial Graph Neural Networks for Multi-label Cross-modal Retrieval

Adversarial Cross-Modal Retrieval

Adversarial Cross-Modal Retrieval via Learning and Transferring Single-Modal Similarities

Deep Multi-Graph Hierarchical Enhanced Semantic Representation for Cross-Modal Retrieval

Deep Attentional Fine-Grained Similarity Network with Adversarial Learning for Cross-Modal Retrieval

Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval

Dual discriminant adversarial cross-modal retrieval

Transformer-based Multi-Modal Learning for Multi Label Remote Sensing Image Classification

Graph Convolutional Multi-Label Hashing for Cross-Modal Retrieval

Multi-label semantic sharing based on graph convolutional network for image-to-text retrieval

Category Alignment Adversarial Learning for Cross-modal Retrieval

MHTN: Modal-Adversarial Hybrid Transfer Network for Cross-Modal Retrieval

Deep Supervised Dual Cycle Adversarial Network for Cross-Modal Retrieval

S-MAT: Semantic-Driven Masked Attention Transformer for Multi-Label Aerial Image Classification.

MARS: Learning Modality-Agnostic Representation for Scalable Cross-media Retrieval

Asymmetric Vision Transformers for Multi-Label Classification

Annotation Efficient Cross-Modal Retrieval with Adversarial Attentive Alignment