Abstract:We study and address the cross-modal retrieval problem which lies at the heart of visual-textual processing. Its major challenge lies in how to effectively learn a shared multi-modal feature space where the discrepancies of semantically related pairs, such as images and texts, are minimized regardless of their modalities. Most current methods focus on reasoning about cross-modality semantic relations within individual image-text pair to learn the common representation. However, they overlook more global, structural inter-pair knowledge within the dataset, i.e., the graph-structured semantics within each training batch. In this paper, we introduce a graph-based, semantic-constrained learning framework to comprehensively explore the intra- and inter-modality information for cross-modal retrieval. Our idea is to maximally explore the structures of labeled data in graph latent space, and use them as semantic constraints to enforce feature embeddings from the semanticallymatched (image-text) pairs to be more similar and vice versa. It raises a novel graph-constrained common embedding learning paradigm for cross-modal retrieval, which is largely under-explored up to now. Moreover, a GAN-based dual learning approach is used to further improve the discriminability and model the joint distribution across different modalities. Our fully-equipped approach, called Graph-constrained Cross-modal Retrieval (GCR), is able to mine intrinsic structures of training data for model learning and enable reliable cross-modal retrieval. We empirically demonstrate that our GCR can achieve higher accuracy than existing state-of-the-art approaches on Wikipedia, NUS-WIDE-10K, PKU XMedia and Pascal Sentence datasets. Our code will be made publicly available. Code is available at https://github.com/neoscheung/GCR.

Online Cross-Modal Scene Retrieval by Binary Representation and Semantic Graph

Semantic Consistency Hashing for Cross-Modal Retrieval

Discrete Cross-Modal Hashing for Efficient Multimedia Retrieval

Semantic embedding based online cross-modal hashing method

HSGMP: Heterogeneous Scene Graph Message Passing for Cross-modal Retrieval

Learning Sufficient Scene Representation for Unsupervised Cross-Modal Retrieval.

Clustering-Based Semi-Supervised Cross-Modal Retrieval Using Scene Graph

From Sparse to Dense: Semantic Graph Evolutionary Hashing for Unsupervised Cross-Modal Retrieval.

Semantic-rebased cross-modal hashing for scalable unsupervised text-visual retrieval

Exploring Graph-Structured Semantics for Cross-Modal Retrieval

Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval

Semantic-Guided Hashing for Cross-Modal Retrieval

Semi-Supervised Graph Convolutional Hashing Network for Large-Scale Cross-Modal Retrieval

Semantic-consistent cross-modal hashing for large-scale image retrieval

Graph Convolutional Network Hashing for Cross-Modal Retrieval

Deep Multi-Graph Hierarchical Enhanced Semantic Representation for Cross-Modal Retrieval

Efficient Cross-Modal Retrieval via Deep Binary Hashing and Quantization

Deep Multigraph Hierarchical Enhanced Semantic Representation for Cross-Modal Retrieval

Multi-Granularity Semantic Information Integration Graph for Cross-Modal Hash Retrieval

Cross-modal Hashing with Semantic Deep Embedding

Dual graph-structured semantics multi-subspace learning for cross-modal retrieval