Abstract:Cross-modal retrieval aims to address heterogeneity and cross-modal semantic associations between multimedia data of different modalities. Image-text retrieval is a key challenge for cross-modal retrieval, which has made great progress through global alignment between images and text, or local alignment between regions and words. However, this challenge still faces three problems. Firstly, text data usually contains words without semantic meaning; and this redundant information interferes with local alignment between text words and image regions. Secondly, existing attention mechanisms focus only on visual features of image regions, while ignoring information about the spatial relationships between individual detected objects in an image, such as relative position and size. This information is often critical for understanding content features in an image. Finally, text words or image regions may have different semantics in different global contexts, so we should consider overall semantic matching and mine deeper semantic information expressed by images and texts. To solve these problems, we proposes Semantic Enhancement and Multi-level Alignment Network (SEMAN) for cross-modal retrieval. Firstly, a multi-head self-attention mechanism after word embedding is introduced to filter the words without semantic meaning in text sentences. Secondly, the image position relation embedding is proposed by modifying the self-attention weight matrix to incorporate the spatial relationship information between image regions. Finally, we introduce a multi-level alignment matching module to understand complex correlations between images and text. Extensive experiments on two benchmark datasets, i.e., Flickr30K and MSCOCO, demonstrate the effectiveness of our SEMAN, achieving state-of-the art performance.

Crossmedia retrieval by learning rich semantic embeddings of multimedia

Image Retrieval by Cross-Media Relevance Fusion.

Multiple Kernel Visual-Auditory Representation Learning for Retrieval

A Benchmark Dataset and Learning High-Level Semantic Embeddings of Multimedia for Cross-Media Retrieval.

Modality-dependent Cross-media Retrieval

Cross-media semantic representation via bi-directional learning to rank.

Cross-Media Retrieval by Multimodal Representation Fusion with Deep Networks.

Manifold Learning Based Cross-media Retrieval: A Solution to Media Object Complementary Nature

Cross-Modal Image-Text Retrieval with Semantic Consistency

Cross-Media Retrieval: Concepts, Advances And Challenges

Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

Learning a Semantic Space for Modeling Images, Tags and Feelings in Cross-Media Search.

Cross-media retrieval using query dependent search methods

Cross-Media Hashing with Neural Networks

Learning Semantic Correlations for Cross-Media Retrieval.

Understanding Multimedia Document Semantics for Cross-Media Retrieval

Scientific and Technological Information Oriented Semantics-adversarial and Media-adversarial Cross-media Retrieval

Mining Semantic Correlation of Heterogeneous Multimedia Data for Cross-Media Retrieval

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Cross-Media Retrieval via Semantic Entity Projection.