Abstract:With the rapid development of multimedia content on the Internet, cross-media retrieval has become a key problem in both research and application. Cross-media retrieval is able to retrieve the results of the same semantics with the query, but with different media types. For instance, given a query image of Moraine Lake, besides retrieving the images about Moraine Lake, cross-media retrieval system can also retrieve the related media contents of different media types such as text description. As a result, measuring content similarity between different media is a challenging problem. In this paper, we propose a novel cross-media similarity measure. It considers both intra-media and inter-media correlation, which are ignored by existing works. Intra-media correlation focuses on semantic category information within each media, while inter-media correlation focuses on positive and negative correlations between different media types. Both of them are very important and their adaptive fusion can complement each other. To mine the intra-media correlation, we propose a heterogeneous similarity measure with nearest neighbors (HSNN). The heterogeneous similarity is obtained by computing the probability for two media objects belonging to the same semantic category. To mine the inter-media correlation, we propose a cross-media correlation propagation (CMCP) approach to simultaneously deal with positive and negative correlation between media objects of different media types, while existing works focus solely on the positive correlation. Negative correlation is very important because it provides effective exclusive information. The correlations are modeled as must-link constraints and cannot-link constraints, respectively. Furthermore, our approach is able to propagate the correlation between heterogeneous modalities. Finally, both HSNN and CMCP are flexible, so that any traditional similarity measure could be incorporated. An effective ranking model is learned by further fusion of multiple similarity measures through AdaRank for cross-media retrieval. The experimental results on two datasets show the effectiveness of our proposed approach, compared with state-of-the-art methods.

Local Self-Attention on Fine-grained Cross-media Retrieval

Self-Attention based Fine-Grained Cross-Media Hybrid Network

Attention-Sharing Correlation Learning For Cross-Media Retrieval

Deep Attentional Fine-Grained Similarity Network with Adversarial Learning for Cross-Modal Retrieval

A New Benchmark and Approach for Fine-grained Cross-media Retrieval

Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Coarse-to-fine dual-level attention for video-text cross modal retrieval

Cross-media Multi-level Alignment with Relation Attention Network

Learning Discriminative Representations for Semantic Cross Media Retrieval

Cross‐modal retrieval with dual multi‐angle self‐attention

Modality-Specific Cross-Modal Similarity Measurement With Recurrent Attention Network

Tri-space and Ranking Based Heterogeneous Similarity Measure for Cross-Media Retrieval.

All the attention you need: Global-local, spatial-channel attention for image retrieval

Cross‐media search method based on complementary attention and generative adversarial network for social networks

Multi-step Self-attention Network for Cross-modal Retrieval Based on a Limited Text Space.

Bridging the gap between visual and auditory feature spaces for cross-media retrieval

Modeling Localness for Self-Attention Networks

Cross-media Retrieval by Intra-Media and Inter-Media Correlation Mining

Cross-media Retrieval by Exploiting Fine-Grained Correlation at Entity Level

Semantic enhancement and multi-level alignment network for cross-modal retrieval