jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Saba Sturua,Isabelle Mohr,Mohammad Kalim Akram,Michael Günther,Bo Wang,Markus Krimmel,Feng Wang,Georgios Mastrapas,Andreas Koukounas,Nan Wang,Han Xiao

2024-09-19

Abstract:We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks. With a default output dimension of 1024, users can flexibly reduce the embedding dimensions to as low as 32 without compromising performance, enabled by Matryoshka Representation Learning.

Computation and Language,Artificial Intelligence,Information Retrieval

What problem does this paper attempt to address?

The paper aims to address the following issues: 1. **Limitations of General Text Embedding Models**: Traditional embedding models, although termed as general-purpose, often require fine-tuning for specific tasks and perform poorly in common failure cases. Additionally, despite their claims of generality, their performance across different tasks is inconsistent. 2. **Application Challenges of Large Language Models (LLMs)**: While large language models excel in handling multi-language and multi-task scenarios, their massive parameter size makes practical deployment difficult. Furthermore, compared to encoder-only models, the marginal performance gains from LLMs are lower, making them less practical for many use cases. To address these issues, the authors propose jina-embeddings-v3, a new text embedding model with 570 million parameters, featuring the following characteristics: - Supports multi-language data and long-context retrieval tasks up to 8192 tokens. - Utilizes Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for specific tasks, suitable for various tasks such as query-document retrieval, clustering, classification, and text matching. - In the MTEB benchmark, this model not only significantly outperforms its predecessor jina-embeddings-v2 but also surpasses the latest proprietary embedding models from OpenAI and Cohere, as well as multilingual-e5-large-instruct in all multi-language tasks. By introducing these improvements, jina-embeddings-v3 not only enhances performance in multi-language environments but also becomes more efficient and practical in production settings.

jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents

Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models

LongEmbed: Extending Embedding Models for Long Context Retrieval

Multilingual E5 Text Embeddings: A Technical Report

Matryoshka-Adaptor: Unsupervised and Supervised Tuning for Smaller Embedding Dimensions

Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings

Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

HyperLoRA: Efficient Cross-task Generalization Via Constrained Low-Rank Adapters Generation

Enhancing Embedding Performance through Large Language Model-based Text Enrichment and Rewriting

Making Text Embedders Few-Shot Learners

2D Matryoshka Sentence Embeddings

User-LLM: Efficient LLM Contextualization with User Embeddings

Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora

mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval

Improving Text Embeddings with Large Language Models

MURAL: Multimodal, Multitask Retrieval Across Languages

Multilingual Sentence-T5: Scalable Sentence Encoders for Multilingual Applications

MultiLoRA: Democratizing LoRA for Better Multi-Task Learning