BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models

Kun Luo,Zheng Liu,Shitao Xiao,Kang Liu

2024-02-18

Abstract:Large language models (LLMs) call for extension of context to handle many critical applications. However, the existing approaches are prone to expensive costs and inferior quality of context extension. In this work, we proposeExtensible Embedding, which realizes high-quality extension of LLM's context with strong flexibility and cost-effectiveness. Extensible embedding stand as an enhancement of typical token embedding, which represents the information for an extensible scope of context instead of a single token. By leveraging such compact input units of higher information density, the LLM can access to a vast scope of context even with a small context window. Extensible embedding is systematically optimized in architecture and training method, which leads to multiple advantages. 1) High flexibility of context extension, which flexibly supports ad-hoc extension of diverse context lengths. 2) Strong sample efficiency of training, which enables the embedding model to be learned in a cost-effective way. 3) Superior compatibility with the existing LLMs, where the extensible embedding can be seamlessly introduced as a plug-in component. Comprehensive evaluations on long-context language modeling and understanding tasks verify extensible embedding as an effective, efficient, flexible, and compatible method to extend the LLM's context.

Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in long - context language modeling, existing retrieval - enhanced methods usually rely on chunked context, which will lead to problems such as the decline in the quality of semantic representation and incomplete retrieval of useful information. Specifically, when dealing with long texts, traditional retrieval - enhanced methods will first divide the text into multiple chunks, and then encode and retrieve each chunk. This practice is likely to break the coherence of the context and may divide continuous information into different chunks, resulting in incomplete retrieved information. To solve these problems, the paper proposes a new method - Landmark Embedding. The main features of this method include: 1. **Chunk - free architecture**: By introducing special tokens (landmarks, LMK), the coherence of the long context is maintained, thereby generating high - quality embeddings of fine - grained units (such as sentences). 2. **Position - aware objective function**: Give priority to the final boundaries of continuous information fragments, so that useful information can be comprehensively retrieved. 3. **Multi - stage learning algorithm**: Utilize different data sources and training strategies to efficiently train the Landmark Embedding model. Through these technological improvements, the paper aims to improve the retrieval - enhanced effect in long - context tasks, especially when dealing with texts that exceed the context window lengths of existing large language models (LLMs). Experimental results show that the Landmark Embedding method significantly outperforms existing retrieval - enhanced methods in a variety of long - context tasks.

BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models

Extensible Embedding: A Flexible Multipler For LLM's Context Length

LongEmbed: Extending Embedding Models for Long Context Retrieval

Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization

E^2-LLM: Efficient and Extreme Length Extension of Large Language Models

Long-Context Language Modeling with Parallel Context Encoding

E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

LooGLE: Can Long-Context Language Models Understand Long Contexts?

Retrieve Anything To Augment Large Language Models

EmbedLLM: Learning Compact Representations of Large Language Models

CLEX: Continuous Length Extrapolation for Large Language Models

Making Text Embedders Few-Shot Learners

Retrieval meets Long Context Large Language Models

Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

Extending LLMs' Context Window with 100 Samples

Embedding-Aligned Language Models

Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding

LLMAEL: Large Language Models are Good Context Augmenters for Entity Linking

Naive Bayes-based Context Extension for Large Language Models

Augmenting Language Models with Long-Term Memory