XL3M: A Training-free Framework for LLM Length Extension Based on Segment-wise Inference

Shengnan Wang,Youhui Bai,Lin Zhang,Pingyi Zhou,Shixiong Zhao,Gong Zhang,Sen Wang,Renhai Chen,Hua Xu,Hongwei Sun
2024-05-28
Abstract:Length generalization failure problem, namely the large language model (LLM) fails to generalize to texts longer than its maximum training length, greatly restricts the application of LLM in the scenarios with streaming long inputs. To address this problem, the existing methods either require substantial costs or introduce precision loss. In this paper, we empirically find that the accuracy of the LLM's prediction is highly correlated to its certainty. Based on this, we propose an efficient training free framework, named XL3M (it means extra-long large language model), which enables the LLMs trained on short sequences to reason extremely long sequence without any further training or fine-tuning. Under the XL3M framework, the input context will be firstly decomposed into multiple short sub-contexts, where each sub-context contains an independent segment and a common ``question'' which is a few tokens from the end of the original context. Then XL3M gives a method to measure the relevance between each segment and the ``question'', and constructs a concise key context by splicing all the relevant segments in chronological order. The key context is further used instead of the original context to complete the inference task. Evaluations on comprehensive benchmarks show the superiority of XL3M. Using our framework, a Llama2-7B model is able to reason 20M long sequences on an 8-card Huawei Ascend 910B NPU machine with 64GB memory per card.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper "XL3M: Length Expansion of Large Language Models Based on Paragraph Reasoning" mainly addresses the issue of generalization failure in large language models (LLMs) when dealing with texts that exceed their maximum training length, known as the length generalization failure problem. This limitation restricts the application of LLMs in scenarios that require long input, such as multi-turn dialogue, dialogue guidance, and document summarization tasks. Existing methods either require significant costs, such as continuous training or fine-tuning, or result in accuracy loss. In the paper, the researchers found a high correlation between the accuracy and determinism of LLM predictions. Therefore, they propose an efficient framework called XL3M, which does not require additional training, to enable LLMs trained on short sequences to understand and process extremely long sequences. XL3M decomposes the input context into multiple short sub-contexts containing independent paragraphs and a common "query", and then constructs a concise key context by measuring the relevance between each paragraph and the "query". This key context is used instead of the original context for inference tasks. This approach reduces irrelevant context and allows LLMs to generate high-quality results based on the extracted key context. The main contributions of the paper include: 1. Introducing the XL3M framework and demonstrating the high correlation between the accuracy of LLM predictions and their determinism (measured by entropy), and leveraging this principle to achieve length expansion without training. 2. Evaluating XL3M on a series of comprehensive benchmark tests and widely used "needle in a haystack" tasks, demonstrating its superior performance compared to other state-of-the-art methods (including fine-tuning and non-fine-tuning methods). 3. XL3M does not modify the basic structure of LLMs, does not require additional training or fine-tuning, and demonstrates excellent performance in terms of time and memory efficiency, capable of handling sequences of over 20M on an 8-card Huawei Ascend 910B NPU machine. The paper also reviews existing length expansion techniques, including fine-tuning-based, non-fine-tuning-based, and external memory-based methods, analyzing their limitations and effectiveness. Finally, XL3M demonstrates its effectiveness and time efficiency in handling long sequences.