Abstract:Cross-modal retrieval aims to address heterogeneity and cross-modal semantic associations between multimedia data of different modalities. Image-text retrieval is a key challenge for cross-modal retrieval, which has made great progress through global alignment between images and text, or local alignment between regions and words. However, this challenge still faces three problems. Firstly, text data usually contains words without semantic meaning; and this redundant information interferes with local alignment between text words and image regions. Secondly, existing attention mechanisms focus only on visual features of image regions, while ignoring information about the spatial relationships between individual detected objects in an image, such as relative position and size. This information is often critical for understanding content features in an image. Finally, text words or image regions may have different semantics in different global contexts, so we should consider overall semantic matching and mine deeper semantic information expressed by images and texts. To solve these problems, we proposes Semantic Enhancement and Multi-level Alignment Network (SEMAN) for cross-modal retrieval. Firstly, a multi-head self-attention mechanism after word embedding is introduced to filter the words without semantic meaning in text sentences. Secondly, the image position relation embedding is proposed by modifying the self-attention weight matrix to incorporate the spatial relationship information between image regions. Finally, we introduce a multi-level alignment matching module to understand complex correlations between images and text. Extensive experiments on two benchmark datasets, i.e., Flickr30K and MSCOCO, demonstrate the effectiveness of our SEMAN, achieving state-of-the art performance.

Scene-text aware cross-modal retrieval based on semantic matching (ChinaMM2024)

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

StacMR: Scene-Text Aware Cross-Modal Retrieval

Visual Matching is Enough for Scene Text Retrieval.

Visual and semantic guided scene text retrieval

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Multi-view and region reasoning semantic enhancement for image-text retrieval

Beyond visual semantics: Exploring the role of scene text in image understanding

HSGMP: Heterogeneous Scene Graph Message Passing for Cross-modal Retrieval

Cross-Modal Image-Text Retrieval with Semantic Consistency

CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition

Context-Aware Attention Network for Image-Text Retrieval

Multi-step Self-attention Network for Cross-modal Retrieval Based on a Limited Text Space.

Short text matching model with multiway semantic interaction based on multi-granularity semantic embedding

Image-Text Matching with Multi-View Attention

Scene Graph Based Fusion Network For Image-Text Retrieval

Scene Text Detection via Holistic, Multi-Channel Prediction

Dual Semantic Relationship Attention Network for Image-Text Matching

Cross-Modal Attention With Semantic Consistence for Image–Text Matching

A Text-Context-Aware CNN Network for Multi-oriented and Multi-language Scene Text Detection.