Abstract:Cross-lingual cross-modal retrieval (CCR) aims to retrieve visually relevant content based on non-English queries, without relying on human-labeled cross-modal data pairs during training. One popular approach involves utilizing machine translation (MT) to create pseudo-parallel data pairs, establishing correspondence between visual and non-English textual data. However, aligning their representations poses challenges due to the significant semantic gap between vision and text, as well as the lower quality of non-English representations caused by pre-trained encoders and data noise. To overcome these challenges, we propose LECCR, a novel solution that incorporates the multi-modal large language model (MLLM) to improve the alignment between visual and non-English representations. Specifically, we first employ MLLM to generate detailed visual content descriptions and aggregate them into multi-view semantic slots that encapsulate different semantics. Then, we take these semantic slots as internal features and leverage them to interact with the visual features. By doing so, we enhance the semantic information within the visual features, narrowing the semantic gap between modalities and generating local visual semantics for subsequent multi-level matching. Additionally, to further enhance the alignment between visual and non-English features, we introduce softened matching under English guidance. This approach provides more comprehensive and reliable inter-modal correspondences between visual and non-English features. Extensive experiments on four CCR benchmarks, \ie Multi30K, MSCOCO, VATEX, and MSR-VTT-CN, demonstrate the effectiveness of our proposed method. Code: \url{<a class="link-external link-https" href="https://github.com/LiJiaBei-7/leccr" rel="external noopener nofollow">this https URL</a>}.

Multi-View Lstm Language Model With Word-Synchronized Auxiliary Feature For Lvcsr

Future Vector Enhanced LSTM Language Model for LVCSR

Multi-modal Auto-regressive Modeling via Visual Words

On Modular Training of Neural Acoustics-to-Word Model for LVCSR

Multi-Stream Posterior Features and Combining Subspace Gmms for Low Resource Lvcsr

Large-scale Language Model Rescoring on Long-form Data

NEWLSTM: an Optimized Long Short-Term Memory Language Model for Sequence Prediction.

Major–Minor Long Short-Term Memory for Word-Level Language Model

Using Large Language Model for End-to-End Chinese ASR and NER

Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval

Long-Short Range Context Neural Networks for Language Modeling

Augmenting Language Models with Long-Term Memory

Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data

Unified Generative and Discriminative Training for Multi-modal Large Language Models

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Large Language Models Are Strong Audio-Visual Speech Recognition Learners

Advancing Multi-talker ASR Performance with Large Language Models

Visual Information Assisted Mandarin Large Vocabulary Continuous Speech Recognition

Enhancing Large Language Model with Self-Controlled Memory Framework

Unified Lexical Representation for Interpretable Visual-Language Alignment

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study