jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Saba Sturua,Isabelle Mohr,Mohammad Kalim Akram,Michael Günther,Bo Wang,Markus Krimmel,Feng Wang,Georgios Mastrapas,Andreas Koukounas,Nan Wang,Han Xiao
2024-09-19
Abstract:We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks. With a default output dimension of 1024, users can flexibly reduce the embedding dimensions to as low as 32 without compromising performance, enabled by Matryoshka Representation Learning.
Computation and Language,Artificial Intelligence,Information Retrieval
What problem does this paper attempt to address?
The paper aims to address the following issues: 1. **Limitations of General Text Embedding Models**: Traditional embedding models, although termed as general-purpose, often require fine-tuning for specific tasks and perform poorly in common failure cases. Additionally, despite their claims of generality, their performance across different tasks is inconsistent. 2. **Application Challenges of Large Language Models (LLMs)**: While large language models excel in handling multi-language and multi-task scenarios, their massive parameter size makes practical deployment difficult. Furthermore, compared to encoder-only models, the marginal performance gains from LLMs are lower, making them less practical for many use cases. To address these issues, the authors propose jina-embeddings-v3, a new text embedding model with 570 million parameters, featuring the following characteristics: - Supports multi-language data and long-context retrieval tasks up to 8192 tokens. - Utilizes Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for specific tasks, suitable for various tasks such as query-document retrieval, clustering, classification, and text matching. - In the MTEB benchmark, this model not only significantly outperforms its predecessor jina-embeddings-v2 but also surpasses the latest proprietary embedding models from OpenAI and Cohere, as well as multilingual-e5-large-instruct in all multi-language tasks. By introducing these improvements, jina-embeddings-v3 not only enhances performance in multi-language environments but also becomes more efficient and practical in production settings.