Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models

Michael Günther,Louis Milliken,Jonathan Geuter,Georgios Mastrapas,Bo Wang,Han Xiao
2023-10-20
Abstract:Jina Embeddings constitutes a set of high-performance sentence embedding models adept at translating textual inputs into numerical representations, capturing the semantics of the text. These models excel in applications like dense retrieval and semantic textual similarity. This paper details the development of Jina Embeddings, starting with the creation of high-quality pairwise and triplet datasets. It underlines the crucial role of data cleaning in dataset preparation, offers in-depth insights into the model training process, and concludes with a comprehensive performance evaluation using the Massive Text Embedding Benchmark (MTEB). Furthermore, to increase the model's awareness of grammatical negation, we construct a novel training and evaluation dataset of negated and non-negated statements, which we make publicly available to the community.
Computation and Language,Artificial Intelligence,Information Retrieval,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the following issues: 1. **Development of High-Quality Sentence Embedding Models**: The paper introduces JINAEMBEDDINGS, a set of high-performance sentence embedding models that can transform text input into numerical representations and capture the semantic information of the text. These models perform excellently in tasks such as dense retrieval and semantic textual similarity. 2. **Effectiveness of Data Preprocessing Strategies**: The study explores the effects of different data preprocessing strategies, including data cleaning, language filtering, and consistency filtering, to improve the quality of the model training data. 3. **Choice of Loss Function**: The paper discusses the optimal loss function for training sentence embedding models, particularly the application of contrastive loss. 4. **Impact of Parameter Scale on Performance**: The research analyzes the impact of increasing the number of model parameters on performance, aiming to maintain competitive performance while reducing the required training data. 5. **Improvement in Handling Negative Sentences**: To enhance the model's ability to recognize negative sentences, the researchers created a new training and evaluation dataset that includes both negative and non-negative statements and made it publicly available for community use. Through these studies, the authors hope to demonstrate that high-performance sentence embedding models can be achieved even when trained on relatively small datasets. They also propose future directions for improvement, such as optimizing sampling rate selection methods and extending the models to bilingual environments.