BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen,Shitao Xiao,Peitian Zhang,Kun Luo,Defu Lian,Zheng Liu
2024-06-28
Abstract:In this paper, we present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It can support more than 100 working languages, leading to new state-of-the-art performances on multi-lingual and cross-lingual retrieval tasks. It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications. It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. The effective training of M3-Embedding involves the following technical contributions. We propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, enabling a large batch size and high training throughput to ensure the discriminativeness of embeddings. To the best of our knowledge, M3-Embedding is the first embedding model which realizes such a strong versatility. The model and code will be publicly available at <a class="link-external link-https" href="https://github.com/FlagOpen/FlagEmbedding" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the following three main issues: 1. **Insufficient multilingual support**: Most existing text embedding models are optimized only for English, with relatively less support for other languages. This limits the application of these models in multilingual environments. 2. **Single retrieval functionality**: Existing embedding models typically support only one specific retrieval function (such as dense retrieval, sparse retrieval, or multi-vector retrieval), whereas actual information retrieval systems often require a combination of multiple retrieval methods. 3. **Limited ability to handle long documents**: Due to high training costs, most embedding models can only handle shorter inputs and cannot effectively process long documents (exceeding several thousand words). To address the above challenges, the paper introduces the **M3-Embedding** model, which achieves breakthroughs in the following areas: - **Multilinguality**: M3-Embedding can support over 100 working languages, achieving multilingual and cross-lingual retrieval by learning a common semantic space for different languages. - **Multifunctionality**: M3-Embedding can generate diverse embeddings, supporting various retrieval functions such as dense retrieval, sparse retrieval, and multi-vector retrieval. - **Multigranularity**: M3-Embedding can handle different input granularities, from short sentences to long documents up to 8,192 tokens. The paper optimizes the quality of embeddings through a series of technical innovations, including: - **Self-knowledge distillation framework**: Enhances the training process by integrating relevance scores from different retrieval functions as teacher signals. - **Batch processing strategy optimization**: Achieves large batch sizes and high training throughput, improving the discriminative power of embeddings. - **High-quality data curation**: Collects large-scale unsupervised data, annotated data, and synthetic data comprehensively to ensure the model's diversity and generalization ability. Experimental results show that M3-Embedding performs excellently in multiple benchmarks, including multilingual retrieval, cross-lingual retrieval, and long document retrieval, achieving new state-of-the-art levels.