Abstract:We present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging a key idea: distilling knowledge from large language models (LLMs) into a retriever. Our two-step distillation process begins with generating diverse, synthetic paired data using an LLM. Next, we further refine the data quality by retrieving a set of candidate passages for each query, and relabeling the positive and hard negative passages using the same LLM. The effectiveness of our approach is demonstrated by the compactness of the Gecko. On the Massive Text Embedding Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing entries with 768 embedding size. Gecko with 768 embedding dimensions achieves an average score of 66.31, competing with 7x larger models and 5x higher dimensional embeddings.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to use the knowledge of large - language models (LLMs) to create a compact and multi - functional text - embedding model to improve its performance in various natural - language - processing tasks. Specifically, the paper proposes a text - embedding model named Gecko, which achieves this goal by distilling knowledge from large - language models. ### Main Problems and Solutions 1. **Challenges**: - Existing text - embedding models usually require a large amount of training data to cover the required domains and skills. - Creating large - scale, high - quality annotated datasets is time - consuming and expensive, and may lead to data bias and lack of diversity. - How to effectively use the knowledge of large - language models (LLMs) to improve text - embedding models? 2. **Solutions**: - **Two - step distillation process**: 1. **Generate diverse synthetic data**: Use LLM to generate diverse task descriptions and query pairs. 2. **Refine data quality**: Retrieve candidate paragraphs for each query, and use the same LLM to relabel positive samples and hard negative samples. 3. **Innovations**: - Use LLM to generate diverse and high - quality synthetic data, avoiding the high cost and potential bias of manual data annotation. - Propose FRet (Few - shot Prompted Retrieval dataset), a dataset generated and ranked based on LLM, for training and optimizing text - embedding models. - The Gecko model performs well in multiple benchmark tests, especially in terms of compactness and multi - task adaptability. ### Experimental Results - **Massive Text Embedding Benchmark (MTEB)**: - Gecko outperforms all existing similar models on MTEB, especially achieving new state - of - the - art levels in classification, semantic similarity, and summarization tasks. - Even when using only the FRet dataset for zero - shot training, Gecko still performs well. - **Multilingual retrieval tasks**: - Although the FRet dataset contains only English content, the multilingual version of Gecko also performs well in cross - language retrieval tasks, outperforming other baseline models. ### Summary The paper, by introducing the Gecko model, shows how to effectively use the knowledge of large - language models to create an efficient and multi - functional text - embedding model. This method not only reduces the dependence on large - scale annotated data but also significantly improves the model's performance in various NLP tasks.

Gecko: Versatile Text Embeddings Distilled from Large Language Models

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Making Text Embedders Few-Shot Learners

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

EmbedLLM: Learning Compact Representations of Large Language Models

GECKO: Generative Language Model for English, Code and Korean

Towards Robust Text Retrieval with Progressive Learning

Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark

Retrieve Anything To Augment Large Language Models

BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models

Enhancing Embedding Performance through Large Language Model-based Text Enrichment and Rewriting

Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free

ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning

Language Models are Universal Embedders

MANGO: A Benchmark for Evaluating Mapping and Navigation Abilities of Large Language Models

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Bit Cipher -- A Simple yet Powerful Word Representation System that Integrates Efficiently with Language Models

GenEOL: Harnessing the Generative Power of LLMs for Training-Free Sentence Embeddings

Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks