Semantic enhancement and multi-level alignment network for cross-modal retrieval
Jia Chen,Hong Zhang
DOI: https://doi.org/10.1007/s11042-023-17956-5
IF: 2.577
2024-01-14
Multimedia Tools and Applications
Abstract:Cross-modal retrieval aims to address heterogeneity and cross-modal semantic associations between multimedia data of different modalities. Image-text retrieval is a key challenge for cross-modal retrieval, which has made great progress through global alignment between images and text, or local alignment between regions and words. However, this challenge still faces three problems. Firstly, text data usually contains words without semantic meaning; and this redundant information interferes with local alignment between text words and image regions. Secondly, existing attention mechanisms focus only on visual features of image regions, while ignoring information about the spatial relationships between individual detected objects in an image, such as relative position and size. This information is often critical for understanding content features in an image. Finally, text words or image regions may have different semantics in different global contexts, so we should consider overall semantic matching and mine deeper semantic information expressed by images and texts. To solve these problems, we proposes Semantic Enhancement and Multi-level Alignment Network (SEMAN) for cross-modal retrieval. Firstly, a multi-head self-attention mechanism after word embedding is introduced to filter the words without semantic meaning in text sentences. Secondly, the image position relation embedding is proposed by modifying the self-attention weight matrix to incorporate the spatial relationship information between image regions. Finally, we introduce a multi-level alignment matching module to understand complex correlations between images and text. Extensive experiments on two benchmark datasets, i.e., Flickr30K and MSCOCO, demonstrate the effectiveness of our SEMAN, achieving state-of-the art performance.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering