Abstract:Cross-modal retrieval aims to address heterogeneity and cross-modal semantic associations between multimedia data of different modalities. Image-text retrieval is a key challenge for cross-modal retrieval, which has made great progress through global alignment between images and text, or local alignment between regions and words. However, this challenge still faces three problems. Firstly, text data usually contains words without semantic meaning; and this redundant information interferes with local alignment between text words and image regions. Secondly, existing attention mechanisms focus only on visual features of image regions, while ignoring information about the spatial relationships between individual detected objects in an image, such as relative position and size. This information is often critical for understanding content features in an image. Finally, text words or image regions may have different semantics in different global contexts, so we should consider overall semantic matching and mine deeper semantic information expressed by images and texts. To solve these problems, we proposes Semantic Enhancement and Multi-level Alignment Network (SEMAN) for cross-modal retrieval. Firstly, a multi-head self-attention mechanism after word embedding is introduced to filter the words without semantic meaning in text sentences. Secondly, the image position relation embedding is proposed by modifying the self-attention weight matrix to incorporate the spatial relationship information between image regions. Finally, we introduce a multi-level alignment matching module to understand complex correlations between images and text. Extensive experiments on two benchmark datasets, i.e., Flickr30K and MSCOCO, demonstrate the effectiveness of our SEMAN, achieving state-of-the art performance.

Multi-level multilingual semantic alignment for zero-shot cross-lingual transfer learning

Attention-based Cross-Layer Domain Alignment for Unsupervised Domain Adaptation

Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment

Learning Semantic Alignment Using Global Features and Multi-scale Confidence

Unsupervised Deep Cross-Language Entity Alignment

Improving In-context Learning of Multilingual Generative Language Models with Cross-lingual Alignment

Adaptive multi-scale semantic fusion network for zero-shot learning

Cross-Modal Semantic Alignment before Fusion for Two-Pass End-to-End Spoken Language Understanding

Exploring the Relationship between Alignment and Cross-lingual Transfer in Multilingual Transformers

PreAlign: Boosting Cross-Lingual Transfer by Early Establishment of Multilingual Alignment

Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision

Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning

A Simple and Effective Method to Improve Zero-Shot Cross-Lingual Transfer Learning.

Probing the Emergence of Cross-lingual Alignment during LLM Training

Explicit Alignment Objectives for Multilingual Bidirectional Encoders

Cross-Lingual Named Entity Recognition Using Parallel Corpus: A New Approach Using XLM-RoBERTa Alignment

Multi-level Fusion of Multi-modal Semantic Embeddings for Zero Shot Learning

AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment

Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment