Gecko: Versatile Text Embeddings Distilled from Large Language Models

Jinhyuk Lee,Zhuyun Dai,Xiaoqi Ren,Blair Chen,Daniel Cer,Jeremy R. Cole,Kai Hui,Michael Boratko,Rajvi Kapadia,Wen Ding,Yi Luan,Sai Meher Karthik Duddu,Gustavo Hernandez Abrego,Weiqiang Shi,Nithi Gupta,Aditya Kusupati,Prateek Jain,Siddhartha Reddy Jonnalagadda,Ming-Wei Chang,Iftekhar Naim
2024-03-30
Abstract:We present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging a key idea: distilling knowledge from large language models (LLMs) into a retriever. Our two-step distillation process begins with generating diverse, synthetic paired data using an LLM. Next, we further refine the data quality by retrieving a set of candidate passages for each query, and relabeling the positive and hard negative passages using the same LLM. The effectiveness of our approach is demonstrated by the compactness of the Gecko. On the Massive Text Embedding Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing entries with 768 embedding size. Gecko with 768 embedding dimensions achieves an average score of 66.31, competing with 7x larger models and 5x higher dimensional embeddings.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to use the knowledge of large - language models (LLMs) to create a compact and multi - functional text - embedding model to improve its performance in various natural - language - processing tasks. Specifically, the paper proposes a text - embedding model named Gecko, which achieves this goal by distilling knowledge from large - language models. ### Main Problems and Solutions 1. **Challenges**: - Existing text - embedding models usually require a large amount of training data to cover the required domains and skills. - Creating large - scale, high - quality annotated datasets is time - consuming and expensive, and may lead to data bias and lack of diversity. - How to effectively use the knowledge of large - language models (LLMs) to improve text - embedding models? 2. **Solutions**: - **Two - step distillation process**: 1. **Generate diverse synthetic data**: Use LLM to generate diverse task descriptions and query pairs. 2. **Refine data quality**: Retrieve candidate paragraphs for each query, and use the same LLM to relabel positive samples and hard negative samples. 3. **Innovations**: - Use LLM to generate diverse and high - quality synthetic data, avoiding the high cost and potential bias of manual data annotation. - Propose FRet (Few - shot Prompted Retrieval dataset), a dataset generated and ranked based on LLM, for training and optimizing text - embedding models. - The Gecko model performs well in multiple benchmark tests, especially in terms of compactness and multi - task adaptability. ### Experimental Results - **Massive Text Embedding Benchmark (MTEB)**: - Gecko outperforms all existing similar models on MTEB, especially achieving new state - of - the - art levels in classification, semantic similarity, and summarization tasks. - Even when using only the FRet dataset for zero - shot training, Gecko still performs well. - **Multilingual retrieval tasks**: - Although the FRet dataset contains only English content, the multilingual version of Gecko also performs well in cross - language retrieval tasks, outperforming other baseline models. ### Summary The paper, by introducing the Gecko model, shows how to effectively use the knowledge of large - language models to create an efficient and multi - functional text - embedding model. This method not only reduces the dependence on large - scale annotated data but also significantly improves the model's performance in various NLP tasks.