Abstract:Cross-lingual cross-modal retrieval (CCR) aims to retrieve visually relevant content based on non-English queries, without relying on human-labeled cross-modal data pairs during training. One popular approach involves utilizing machine translation (MT) to create pseudo-parallel data pairs, establishing correspondence between visual and non-English textual data. However, aligning their representations poses challenges due to the significant semantic gap between vision and text, as well as the lower quality of non-English representations caused by pre-trained encoders and data noise. To overcome these challenges, we propose LECCR, a novel solution that incorporates the multi-modal large language model (MLLM) to improve the alignment between visual and non-English representations. Specifically, we first employ MLLM to generate detailed visual content descriptions and aggregate them into multi-view semantic slots that encapsulate different semantics. Then, we take these semantic slots as internal features and leverage them to interact with the visual features. By doing so, we enhance the semantic information within the visual features, narrowing the semantic gap between modalities and generating local visual semantics for subsequent multi-level matching. Additionally, to further enhance the alignment between visual and non-English features, we introduce softened matching under English guidance. This approach provides more comprehensive and reliable inter-modal correspondences between visual and non-English features. Extensive experiments on four CCR benchmarks, \ie Multi30K, MSCOCO, VATEX, and MSR-VTT-CN, demonstrate the effectiveness of our proposed method. Code: \url{<a class="link-external link-https" href="https://github.com/LiJiaBei-7/leccr" rel="external noopener nofollow">this https URL</a>}.

Aligning Multilingual Word Embeddings for Cross-Modal Retrieval Task

Order embeddings and character-level convolutions for multimodal alignment

Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

HybridVocab: Towards Multi-Modal Machine Translation Via Multi-Aspect Alignment

MURAL: Multimodal, Multitask Retrieval Across Languages

Exploring Alignment in Shared Cross-lingual Spaces

Unsupervised Hyperalignment for Multilingual Word Embeddings

Learning Cross-Modal Aligned Representation with Graph Embedding

Jointly Learning Bilingual Word Embeddings and Alignments

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

Annotation Efficient Cross-Modal Retrieval with Adversarial Attentive Alignment

Beyond Bilingual: Multi-sense Word Embeddings using Multilingual Context

Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval

Towards Cross-Modal Text-Molecule Retrieval with Better Modality Alignment

Learning Relation Alignment for Calibrated Cross-modal Retrieval

Cross‐modal retrieval with dual multi‐angle self‐attention

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Improving Cross-Modal Image-Text Retrieval With Teacher-Student Learning

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval