Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization

Ninglu Shao,Shitao Xiao,Zheng Liu,Peitian Zhang
2024-01-16
Abstract:Large language models (LLMs) are in need of sufficient contexts to handle many critical applications, such as retrieval augmented generation and few-shot learning. However, due to the constrained window size, the LLMs can only access to the information within a limited context. Although the size of context window can be extended by fine-tuning, it will result in a substantial cost in both training and inference stage. In this paper, we present Extensible Tokenization as an alternative method which realizes the flexible scaling of LLMs' context. Extensible Tokenization stands as a midware in between of the tokenized context and the LLM, which transforms the raw token embeddings into the extensible embeddings. Such embeddings provide a more compact representation for the long context, on top of which the LLM is able to perceive more information with the same context window. Extensible Tokenization is also featured by its flexibility: the scaling factor can be flexibly determined within a feasible scope, leading to the extension of an arbitrary context length at the inference time. Besides, Extensible Tokenization is introduced as a drop-in component, which can be seamlessly plugged into not only the LLM itself and but also its fine-tuned derivatives, bringing in the extended contextual information while fully preserving the LLM's existing capabilities. We perform comprehensive experiments on long-context language modeling and understanding tasks, which verify Extensible Tokenization as an effective, efficient, flexible, and compatible method to extend LLM's context. Our model and source code will be made publicly available.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the issue of context window limitations faced by large language models (LLMs) when handling long sequence data. Specifically: 1. **Context Window Limitations**: - Existing large language models are constrained by a fixed context window size when dealing with critical tasks such as retrieval-augmented generation and few-shot learning, which prevents them from fully covering the input data. - Although the context window can be extended through fine-tuning, this significantly increases the cost during training and inference stages and may compromise the model's original performance on shorter contexts. 2. **Limitations of Existing Methods**: - Sparse attention requires custom GPU kernels, which are not supported by standard infrastructure. - Stream processing ignores information beyond the context window, and memory compression leads to information loss and incompatibility with existing models. 3. **Proposed Method**: - The paper proposes Extensible Tokenization, a novel approach to extend the context capacity of LLMs without modifying the original model architecture. - Extensible Tokenization acts as a middleware, converting original token embeddings into compact representations called extensible embeddings, allowing the model to perceive more information within the same context window. - This method is highly flexible, strongly compatible, and can effectively enhance the performance of language modeling and understanding tasks in long contexts. Through this approach, the paper aims to provide an efficient, flexible, and compatible way to extend the context processing capabilities of large language models.