Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models

Michael Günther,Louis Milliken,Jonathan Geuter,Georgios Mastrapas,Bo Wang,Han Xiao

2023-10-20

Abstract:Jina Embeddings constitutes a set of high-performance sentence embedding models adept at translating textual inputs into numerical representations, capturing the semantics of the text. These models excel in applications like dense retrieval and semantic textual similarity. This paper details the development of Jina Embeddings, starting with the creation of high-quality pairwise and triplet datasets. It underlines the crucial role of data cleaning in dataset preparation, offers in-depth insights into the model training process, and concludes with a comprehensive performance evaluation using the Massive Text Embedding Benchmark (MTEB). Furthermore, to increase the model's awareness of grammatical negation, we construct a novel training and evaluation dataset of negated and non-negated statements, which we make publicly available to the community.

Computation and Language,Artificial Intelligence,Information Retrieval,Machine Learning

What problem does this paper attempt to address?

The paper aims to address the following issues: 1. **Development of High-Quality Sentence Embedding Models**: The paper introduces JINAEMBEDDINGS, a set of high-performance sentence embedding models that can transform text input into numerical representations and capture the semantic information of the text. These models perform excellently in tasks such as dense retrieval and semantic textual similarity. 2. **Effectiveness of Data Preprocessing Strategies**: The study explores the effects of different data preprocessing strategies, including data cleaning, language filtering, and consistency filtering, to improve the quality of the model training data. 3. **Choice of Loss Function**: The paper discusses the optimal loss function for training sentence embedding models, particularly the application of contrastive loss. 4. **Impact of Parameter Scale on Performance**: The research analyzes the impact of increasing the number of model parameters on performance, aiming to maintain competitive performance while reducing the required training data. 5. **Improvement in Handling Negative Sentences**: To enhance the model's ability to recognize negative sentences, the researchers created a new training and evaluation dataset that includes both negative and non-negative statements and made it publicly available for community use. Through these studies, the authors hope to demonstrate that high-performance sentence embedding models can be achieved even when trained on relatively small datasets. They also propose future directions for improvement, such as optimizing sampling rate selection methods and extending the models to bilingual environments.

Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models

Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents

jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Do Multi-Sense Embeddings Improve Natural Language Understanding?

GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning

GenEOL: Harnessing the Generative Power of LLMs for Training-Free Sentence Embeddings

Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

NewsEmbed: Modeling News through Pre-trained Document Representations

Making Text Embedders Few-Shot Learners

Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Jointly Modeling Embedding and Translation to Bridge Video and Language

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

2D Matryoshka Sentence Embeddings

Enhancing Embedding Performance through Large Language Model-based Text Enrichment and Rewriting

GASE: Generatively Augmented Sentence Encoding

Improving Text Embeddings with Large Language Models

Benchmarking DNA Foundation Models for Genomic Sequence Classification

Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging

Model Embedding dimension : 400-1000 Hidden layer dimension