BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Yuri Kuratov,Aydar Bulatov,Petr Anokhin,Ivan Rodkin,Dmitry Sorokin,Artyom Sorokin,Mikhail Burtsev
2024-06-15
Abstract:In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning, Retrieval-Augmented Generation methods achieve a modest 60\% accuracy on single-fact question answering, independent of context length. Among context extension methods, the highest performance is demonstrated by recurrent memory transformers, enabling the processing of lengths up to 11 million tokens. The BABILong benchmark is extendable to any length to support the evaluation of new upcoming models with increased capabilities, and we provide splits up to 1 million token lengths.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper focuses on the evaluation problem of the ability of large-scale language models (LLMs) to handle long-text contexts. The current evaluation methods have not kept up with the development of model capabilities and cannot comprehensively evaluate the efficiency of models in handling long texts. Therefore, the researchers propose a new benchmark test - BABILong, to test the ability of language models to perform cross-fact reasoning in extremely long documents. This benchmark includes 20 different reasoning tasks, such as fact chains, simple induction, deduction, counting, and handling lists/sets, aiming to challenge models in long-text reasoning in natural language. The paper finds that popular language models actually only effectively utilize 10-20% of the context, and performance significantly decreases with the increase in reasoning complexity. Retrieval-enhanced generation methods achieve an accuracy of 60% in answering single-fact questions, regardless of the length of the context. Among other methods, the Recurrent Memory Transformer performs well and can handle sequences of up to 11 million tokens. The BABILong benchmark is scalable and can adapt to evaluate new models with more powerful capabilities, providing a dataset of up to 1 million tokens. Through the BABILong benchmark test, the researchers reveal the limitations of current LLMs in handling long texts, indicating that even state-of-the-art models suffer from performance degradation, especially in tasks that require extracting key information from a large amount of irrelevant details. Additionally, the study suggests that although retrieval-enhanced generation methods may not perform well in certain tasks, fine-tuning for specific tasks can help improve performance. Finally, they propose an approach called Recurrent Memory Transformer, which successfully answers single-fact questions even when the input text reaches 11 million tokens, setting a new record for a single model handling sequence sizes.