Abstract:In this paper, we present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It can support more than 100 working languages, leading to new state-of-the-art performances on multi-lingual and cross-lingual retrieval tasks. It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications. It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. The effective training of M3-Embedding involves the following technical contributions. We propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, enabling a large batch size and high training throughput to ensure the discriminativeness of embeddings. To the best of our knowledge, M3-Embedding is the first embedding model which realizes such a strong versatility. The model and code will be publicly available at <a class="link-external link-https" href="https://github.com/FlagOpen/FlagEmbedding" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address the following three main issues: 1. **Insufficient multilingual support**: Most existing text embedding models are optimized only for English, with relatively less support for other languages. This limits the application of these models in multilingual environments. 2. **Single retrieval functionality**: Existing embedding models typically support only one specific retrieval function (such as dense retrieval, sparse retrieval, or multi-vector retrieval), whereas actual information retrieval systems often require a combination of multiple retrieval methods. 3. **Limited ability to handle long documents**: Due to high training costs, most embedding models can only handle shorter inputs and cannot effectively process long documents (exceeding several thousand words). To address the above challenges, the paper introduces the **M3-Embedding** model, which achieves breakthroughs in the following areas: - **Multilinguality**: M3-Embedding can support over 100 working languages, achieving multilingual and cross-lingual retrieval by learning a common semantic space for different languages. - **Multifunctionality**: M3-Embedding can generate diverse embeddings, supporting various retrieval functions such as dense retrieval, sparse retrieval, and multi-vector retrieval. - **Multigranularity**: M3-Embedding can handle different input granularities, from short sentences to long documents up to 8,192 tokens. The paper optimizes the quality of embeddings through a series of technical innovations, including: - **Self-knowledge distillation framework**: Enhances the training process by integrating relevance scores from different retrieval functions as teacher signals. - **Batch processing strategy optimization**: Achieves large batch sizes and high training throughput, improving the discriminative power of embeddings. - **High-quality data curation**: Collects large-scale unsupervised data, annotated data, and synthetic data comprehensively to ensure the model's diversity and generalization ability. Experimental results show that M3-Embedding performs excellently in multiple benchmarks, including multilingual retrieval, cross-lingual retrieval, and long document retrieval, achieving new state-of-the-art levels.

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Making Text Embedders Few-Shot Learners

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

Gecko: Versatile Text Embeddings Distilled from Large Language Models

Retrieve Anything To Augment Large Language Models

M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

Towards Robust Text Retrieval with Progressive Learning

Field Embedding: A Unified Grain-Based Framework for Word Representation

Multilingual E5 Text Embeddings: A Technical Report

M3: A Multi-Task Mixed-Objective Learning Framework for Open-Domain Multi-Hop Dense Sentence Retrieval

jina-embeddings-v3: Multilingual Embeddings With Task LoRA

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

GUIM -- General User and Item Embedding with Mixture of Representation in E-commerce

MULE: Multimodal Universal Language Embedding

BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models

Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging

M3DBench: Towards Omni 3D Assistant with Interleaved Multi-modal Instructions

Language Models are Universal Embedders

Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free

Arctic-Embed 2.0: Multilingual Retrieval Without Compromise