Abstract:Nowadays, cross-modal retrieval plays an important role to flexibly find useful information across different modalities of data. Effectively measuring the similarity between different modalities of data is the key of cross-modal retrieval. Different modalities such as image and text have imbalanced and complementary relationship, and they contain unequal amount of information when describing the same semantics. For example, images often contain more details that cannot be demonstrated by textual descriptions and vice versa. Existing works based on Deep Neural Network (DNN) mostly construct one common space for different modalities, to find the latent alignments between them, which lose their exclusive modality-specific characteristics. Therefore, we propose modality-specific cross-modal similarity measurement (MCSM) approach by constructing the independent semantic space for each modality, which adopts an endto- end framework to directly generate modality-specific crossmodal similarity without explicit common representation. For each semantic space, modality-specific characteristics within one modality are fully exploited by recurrent attention network, while the data of another modality is projected into this space with attention based joint embedding, which utilizes the learned attention weights for guiding the fine-grained cross-modal correlation learning, and captures the imbalanced and complementary relationship between different modalities. Finally, the complementarity between the semantic spaces for different modalities is explored by adaptive fusion of the modality-specific cross-modal similarities to perform cross-modal retrieval. Experiments on the widely-used Wikipedia, Pascal Sentence, MS-COCO datasets as well as our constructed large-scale XMediaNet dataset verify the effectiveness of our proposed approach, outperforming 9 stateof- the-art methods.

Metric Based On Multi-Order Spaces For Cross-Modal Retrieval

Discrete Cross-Modal Hashing for Efficient Multimedia Retrieval

Semantic Consistency Hashing for Cross-Modal Retrieval

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

A Metric Learning Method for Image-based 3D Shape Retrieval

Universal Weighting Metric Learning for Cross-Modal Retrieval

Metric networks for enhanced perception of non-local semantic information

Full-Space Local Topology Extraction for Cross-Modal Retrieval

Geometric Matching for Cross-Modal Retrieval

Unsupervised Multi-modal Hashing for Cross-Modal Retrieval

GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning

An Efficient Approach for Geo-Multimedia Cross-Modal Retrieval

Tri-space and Ranking Based Heterogeneous Similarity Measure for Cross-Media Retrieval.

Dual graph-structured semantics multi-subspace learning for cross-modal retrieval

Cross-modal Metric Learning with Graph Embedding.

Deep Multi-Graph Hierarchical Enhanced Semantic Representation for Cross-Modal Retrieval

Adversarial Cross-Modal Retrieval

Modality-Specific Cross-Modal Similarity Measurement With Recurrent Attention Network

Joint Dictionary Learning and Semantic Constrained Latent Subspace Projection for Cross-Modal Retrieval.

Cross-modal Deep Metric Learning with Multi-Task Regularization

On Metric Learning for Audio-Text Cross-Modal Retrieval