BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models

Kun Luo,Zheng Liu,Shitao Xiao,Kang Liu
2024-02-18
Abstract:Large language models (LLMs) call for extension of context to handle many critical applications. However, the existing approaches are prone to expensive costs and inferior quality of context extension. In this work, we proposeExtensible Embedding, which realizes high-quality extension of LLM's context with strong flexibility and cost-effectiveness. Extensible embedding stand as an enhancement of typical token embedding, which represents the information for an extensible scope of context instead of a single token. By leveraging such compact input units of higher information density, the LLM can access to a vast scope of context even with a small context window. Extensible embedding is systematically optimized in architecture and training method, which leads to multiple advantages. 1) High flexibility of context extension, which flexibly supports ad-hoc extension of diverse context lengths. 2) Strong sample efficiency of training, which enables the embedding model to be learned in a cost-effective way. 3) Superior compatibility with the existing LLMs, where the extensible embedding can be seamlessly introduced as a plug-in component. Comprehensive evaluations on long-context language modeling and understanding tasks verify extensible embedding as an effective, efficient, flexible, and compatible method to extend the LLM's context.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in long - context language modeling, existing retrieval - enhanced methods usually rely on chunked context, which will lead to problems such as the decline in the quality of semantic representation and incomplete retrieval of useful information. Specifically, when dealing with long texts, traditional retrieval - enhanced methods will first divide the text into multiple chunks, and then encode and retrieve each chunk. This practice is likely to break the coherence of the context and may divide continuous information into different chunks, resulting in incomplete retrieved information. To solve these problems, the paper proposes a new method - Landmark Embedding. The main features of this method include: 1. **Chunk - free architecture**: By introducing special tokens (landmarks, LMK), the coherence of the long context is maintained, thereby generating high - quality embeddings of fine - grained units (such as sentences). 2. **Position - aware objective function**: Give priority to the final boundaries of continuous information fragments, so that useful information can be comprehensively retrieved. 3. **Multi - stage learning algorithm**: Utilize different data sources and training strategies to efficiently train the Landmark Embedding model. Through these technological improvements, the paper aims to improve the retrieval - enhanced effect in long - context tasks, especially when dealing with texts that exceed the context window lengths of existing large language models (LLMs). Experimental results show that the Landmark Embedding method significantly outperforms existing retrieval - enhanced methods in a variety of long - context tasks.