Abstract:Cross-modal retrieval aims to address heterogeneity and cross-modal semantic associations between multimedia data of different modalities. Image-text retrieval is a key challenge for cross-modal retrieval, which has made great progress through global alignment between images and text, or local alignment between regions and words. However, this challenge still faces three problems. Firstly, text data usually contains words without semantic meaning; and this redundant information interferes with local alignment between text words and image regions. Secondly, existing attention mechanisms focus only on visual features of image regions, while ignoring information about the spatial relationships between individual detected objects in an image, such as relative position and size. This information is often critical for understanding content features in an image. Finally, text words or image regions may have different semantics in different global contexts, so we should consider overall semantic matching and mine deeper semantic information expressed by images and texts. To solve these problems, we proposes Semantic Enhancement and Multi-level Alignment Network (SEMAN) for cross-modal retrieval. Firstly, a multi-head self-attention mechanism after word embedding is introduced to filter the words without semantic meaning in text sentences. Secondly, the image position relation embedding is proposed by modifying the self-attention weight matrix to incorporate the spatial relationship information between image regions. Finally, we introduce a multi-level alignment matching module to understand complex correlations between images and text. Extensive experiments on two benchmark datasets, i.e., Flickr30K and MSCOCO, demonstrate the effectiveness of our SEMAN, achieving state-of-the art performance.

Deep compositional cross-modal learning to rank via local-global alignment

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation.

Cross-Modal Learning to Rank Via Latent Joint Representation

Cross-modal Deep Metric Learning with Multi-Task Regularization

Ranking with local regression and global alignment for cross media retrieval.

A Low Rank Structural Large Margin Method for Cross-Modal Ranking

Learning Multimodal Neural Network with Ranking Examples

Semantic Pre-alignment and Ranking Learning with Unified Framework for Cross-modal Retrieval

Deep Cross-modal Hashing Based on Semantic Consistent Ranking

Cross-media semantic representation via bi-directional learning to rank.

Simple to Complex Cross-modal Learning to Rank

Deep Multi-Graph Hierarchical Enhanced Semantic Representation for Cross-Modal Retrieval

Cross-modal alignment with graph reasoning for image-text retrieval

Semantic enhancement and multi-level alignment network for cross-modal retrieval

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Learning Cross-Modal Aligned Representation with Graph Embedding

End-to-End Cross-Modality Retrieval with CCA Projections and Pairwise Ranking Loss

Deep Multi-Level Semantic Hashing for Cross-Modal Retrieval

Cross-Modal Joint Prediction and Alignment for Composed Query Image Retrieval

Category Alignment Adversarial Learning for Cross-modal Retrieval