Abstract:State-of-the-art neural retrievers predominantly focus on high-resource languages like English, which impedes their adoption in retrieval scenarios involving other languages. Current approaches circumvent the lack of high-quality labeled data in non-English languages by leveraging multilingual pretrained language models capable of cross-lingual transfer. However, these models require substantial task-specific fine-tuning across multiple languages, often perform poorly in languages with minimal representation in the pretraining corpus, and struggle to incorporate new languages after the pretraining phase. In this work, we present a novel modular dense retrieval model that learns from the rich data of a single high-resource language and effectively zero-shot transfers to a wide array of languages, thereby eliminating the need for language-specific labeled data. Our model, ColBERT-XM, demonstrates competitive performance against existing state-of-the-art multilingual retrievers trained on more extensive datasets in various languages. Further analysis reveals that our modular approach is highly data-efficient, effectively adapts to out-of-distribution data, and significantly reduces energy consumption and carbon emissions. By demonstrating its proficiency in zero-shot scenarios, ColBERT-XM marks a shift towards more sustainable and inclusive retrieval systems, enabling effective information accessibility in numerous languages. We publicly release our code and models for the community.

Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer

Zero-Shot Cross-Lingual Transfer in Legal Domain Using Transformer Models

Pushing the Limits of Zero-shot End-to-End Speech Translation

Towards Zero-Shot Multimodal Machine Translation

Improved Cross-Lingual Transfer Learning For Automatic Speech Translation

Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages

Improving Zero-Shot Translation of Low-Resource Languages

Improving Zero-shot Translation with Language-Independent Constraints

Plug, Play, and Fuse: Zero-Shot Joint Decoding via Word-Level Re-ranking Across Diverse Vocabularies

Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

Language Tokens: A Frustratingly Simple Approach Improves Zero-Shot Performance of Multilingual Translation

Towards Zero-shot Cross-lingual Image Retrieval and Tagging

Subword Segmentation and a Single Bridge Language Affect Zero-Shot Neural Machine Translation

A Simple and Effective Method to Improve Zero-Shot Cross-Lingual Transfer Learning.

ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval

On the cross-lingual transferability of monolingual representations

Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training

Multilingual Auxiliary Tasks Training: Bridging the Gap between Languages for Zero-Shot Transfer of Hate Speech Detection Models

An Efficient Approach for Studying Cross-Lingual Transfer in Multilingual Language Models

ChatZero:Zero-shot Cross-Lingual Dialogue Generation via Pseudo-Target Language

Improving Zero-Shot Multilingual Translation with Universal Representations and Cross-Mappings