Abstract:Exploiting relationship among samples in cross-modal data plays a key role in the task of cross-modal retrieval, but most of existing methods only extract the correlation from pairwise samples and ignore the relations of unpaired samples. Some graph regularization methods proposed a reasonable paradigm to exploit the correlation from multiple samples. However, limited by the traditional framework, the performance has much room to improve. Moreover, although some existing DNN-based methods achieve excellent performance, the requirement of massive labeled data is also a shortcoming. In this paper, we propose a novel semi-supervised method, named Semi-supervised Constrained Graph Convolutional Network (SCGCN), which adopts graph convolutional network to exploit correlation from batch samples of data with different modalities. For reducing the requirement of labeled data, we design a two stage training procedure: deep supervised learning stage and unsupervised learning stage. In deep supervised learning stage, we integrate two DNN-based semantic encoding networks and a shared classifier into Deep Cross-modal Semantic Encoding (DCSE) module which is trained by supervised learning with labeled data. From DCSE module, we learn a temporary modality-invariant space where the semantic embeddings of samples with different modalities are modality-invariant, and we also learn a classifier which can generate predicted label from the unlabeled data. In unsupervised learning stage, for fully exploiting the correlation from cross-modal data, we design a Constrained Graph Convolutional Network (CGCN) module which utilizes GCN to exploit the correlation and adopts both intra-modal discriminative loss and inter-modal pairwise similar loss to ensure the generated common representation modality-invariant and semantical discriminative. We perform extensive experiments on four conventional datasets and a large scale dataset to demonstrate the effectiveness of proposed approach.

Fine-Grained Correlation Learning with Stacked Co-attention Networks for Cross-Modal Information Retrieval

Fine-Grained Cross-Modal Retrieval with Triple-Streamed Memory Fusion Transformer Encoder

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation.

Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval

Learning Feature Embedding with Strong Neural Activations for Fine-Grained Retrieval

Adversarial Learning-Based Semantic Correlation Representation for Cross-Modal Retrieval

Stacked Cross Attention for Image-Text Matching

Relation-Aggregated Cross-Graph Correlation Learning for Fine-Grained Image–Text Retrieval

CCL: Cross-modal Correlation Learning with Multi-grained Fusion by Hierarchical Network.

Deep Attentional Fine-Grained Similarity Network with Adversarial Learning for Cross-Modal Retrieval

Self-supervised Correlation Learning for Cross-Modal Retrieval

SCANET: Improving multimodal representation and fusion with sparse- and cross-attention for multimodal sentiment analysis

Attention-Sharing Correlation Learning For Cross-Media Retrieval

Semi-supervised constrained graph convolutional network for cross-modal retrieval

Weighted Graph-structured Semantics Constraint Network for Cross-Modal Retrieval

Similarity Graph-correlation Reconstruction Network for unsupervised cross-modal hashing

Effective Multi-Modal Retrieval Based on Stacked Auto-Encoders

Federated learning for supervised cross-modal retrieval

Semantically Supervised Maximal Correlation for Cross-Modal Retrieval

Learning Shared Semantic Space with Correlation Alignment for Cross-Modal Event Retrieval

Cross‐modal retrieval with dual multi‐angle self‐attention