Abstract:Cross-lingual cross-modal retrieval (CCR) aims to retrieve visually relevant content based on non-English queries, without relying on human-labeled cross-modal data pairs during training. One popular approach involves utilizing machine translation (MT) to create pseudo-parallel data pairs, establishing correspondence between visual and non-English textual data. However, aligning their representations poses challenges due to the significant semantic gap between vision and text, as well as the lower quality of non-English representations caused by pre-trained encoders and data noise. To overcome these challenges, we propose LECCR, a novel solution that incorporates the multi-modal large language model (MLLM) to improve the alignment between visual and non-English representations. Specifically, we first employ MLLM to generate detailed visual content descriptions and aggregate them into multi-view semantic slots that encapsulate different semantics. Then, we take these semantic slots as internal features and leverage them to interact with the visual features. By doing so, we enhance the semantic information within the visual features, narrowing the semantic gap between modalities and generating local visual semantics for subsequent multi-level matching. Additionally, to further enhance the alignment between visual and non-English features, we introduce softened matching under English guidance. This approach provides more comprehensive and reliable inter-modal correspondences between visual and non-English features. Extensive experiments on four CCR benchmarks, \ie Multi30K, MSCOCO, VATEX, and MSR-VTT-CN, demonstrate the effectiveness of our proposed method. Code: \url{<a class="link-external link-https" href="https://github.com/LiJiaBei-7/leccr" rel="external noopener nofollow">this https URL</a>}.

A Cross-Lingual Sentence Pair Interaction Feature Capture Model Based on Pseudo-Corpus and Multilingual Embedding

MTLAN: Multi-Task Learning and Auxiliary Network for Enhanced Sentence Embedding

A Cross-Lingual Sentence Similarity Calculation Method with Multifeature Fusion

Improving Word Embeddings via Combining with Complementary Languages.

Unsupervised Cross-Lingual Sentence Representation Learning via Linguistic Isomorphism

Cross-lingual Feature Extraction from Monolingual Corpora for Low-resource Unsupervised Bilingual Lexicon Induction.

Exploiting Common Characters in Chinese and Japanese to Learn Cross-Lingual Word Embeddings Via Matrix Factorization.

Learning Tibetan-Chinese Cross-Lingual Word Embeddings

Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment

Modelling Interaction of Sentence Pair with Coupled-LSTMs.

A Cross-lingual Sentiment Embedding Model with Semantic and Sentiment Joint Learning

Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval

Learning Multilingual Sentence Embeddings From Monolingual Corpus

Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment

Leveraging Multi-lingual Positive Instances in Contrastive Learning to Improve Sentence Embedding

Exploring Multilingual Syntactic Sentence Representations

Short text matching model with multiway semantic interaction based on multi-granularity semantic embedding

Integrating Word Embeddings and Traditional NLP Features to Measure Textual Entailment and Semantic Relatedness of Sentence Pairs

Learning Cross-lingual Word Embeddings Via Matrix Co-factorization.

Enhancing Multilingual Universal Sentence Embeddings by Monolingual Contrastive Learning

CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval