Abstract:With the growing amount of multimodal data, cross-modal retrieval has attracted more and more attention and become a hot research topic. To date, most of the existing techniques mainly convert multimodal data into a common representation space where similarities in semantics between samples can be easily measured across multiple modalities. However, these approaches may suffer from the following limitations: 1) They overcome the modality gap by introducing loss in the common representation space, which may not be sufficient to eliminate the heterogeneity of various modalities; 2) They treat labels as independent entities and ignore label relationships, which is not conducive to establishing semantic connections across multimodal data; 3) They ignore the non-binary values of label similarity in multi-label scenarios, which may lead to inefficient alignment of representation similarity with label similarity. To tackle these problems, in this article, we propose two models to learn discriminative and modality-invariant representations for cross-modal retrieval. First, the dual generative adversarial networks are built to project multimodal data into a common representation space. Second, to model label relation dependencies and develop inter-dependent classifiers, we employ multi-hop graph neural networks (consisting of Probabilistic GNN and Iterative GNN), where the layer aggregation mechanism is suggested for using propagation information of various hops. Third, we propose a novel soft multi-label contrastive loss for cross-modal retrieval, with the soft positive sampling probability, which can align the representation similarity and the label similarity. Additionally, to adapt to incomplete-modal learning, which can have wider applications, we propose a modal reconstruction mechanism to generate missing features. Extensive experiments on three widely used benchmark datasets, i.e., NUS-WIDE, MIRFlickr, and MS-COCO, show the superiority of our proposed method.

Cross-modal Image-Text Retrieval with Multitask Learning

Multi-Modal Retrieval Via Deep Textual-Visual Correlation Learning

Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval

Improving Cross-Modal Image-Text Retrieval With Teacher-Student Learning

Cross‐modal retrieval with dual multi‐angle self‐attention

Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval

Cross-modal Deep Metric Learning with Multi-Task Regularization

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Multimodal Learning of Social Image Representation by Exploiting Social Relations

CMPD: Using Cross Memory Network With Pair Discrimination for Image-Text Retrieval

Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Multi-task hierarchical convolutional network for visual-semantic cross-modal retrieval

Dual-View Curricular Optimal Transport for Cross-Lingual Cross-Modal Retrieval

Integrating Multi-Label Contrastive Learning With Dual Adversarial Graph Neural Networks for Cross-Modal Retrieval

Deep Supervised Dual Cycle Adversarial Network for Cross-Modal Retrieval

Effective Deep Learning-Based Multi-Modal Retrieval

Cross-modal Image Retrieval with Deep Mutual Information Maximization

Deep Attentional Fine-Grained Similarity Network with Adversarial Learning for Cross-Modal Retrieval

Modality-dependent Cross-media Retrieval

A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing