Abstract:In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning, Retrieval-Augmented Generation methods achieve a modest 60\% accuracy on single-fact question answering, independent of context length. Among context extension methods, the highest performance is demonstrated by recurrent memory transformers, enabling the processing of lengths up to 11 million tokens. The BABILong benchmark is extendable to any length to support the evaluation of new upcoming models with increased capabilities, and we provide splits up to 1 million token lengths.

What problem does this paper attempt to address?

This paper focuses on the evaluation problem of the ability of large-scale language models (LLMs) to handle long-text contexts. The current evaluation methods have not kept up with the development of model capabilities and cannot comprehensively evaluate the efficiency of models in handling long texts. Therefore, the researchers propose a new benchmark test - BABILong, to test the ability of language models to perform cross-fact reasoning in extremely long documents. This benchmark includes 20 different reasoning tasks, such as fact chains, simple induction, deduction, counting, and handling lists/sets, aiming to challenge models in long-text reasoning in natural language. The paper finds that popular language models actually only effectively utilize 10-20% of the context, and performance significantly decreases with the increase in reasoning complexity. Retrieval-enhanced generation methods achieve an accuracy of 60% in answering single-fact questions, regardless of the length of the context. Among other methods, the Recurrent Memory Transformer performs well and can handle sequences of up to 11 million tokens. The BABILong benchmark is scalable and can adapt to evaluate new models with more powerful capabilities, providing a dataset of up to 1 million tokens. Through the BABILong benchmark test, the researchers reveal the limitations of current LLMs in handling long texts, indicating that even state-of-the-art models suffer from performance degradation, especially in tasks that require extracting key information from a large amount of irrelevant details. Additionally, the study suggests that although retrieval-enhanced generation methods may not perform well in certain tasks, fine-tuning for specific tasks can help improve performance. Finally, they propose an approach called Recurrent Memory Transformer, which successfully answers single-fact questions even when the input text reaches 11 million tokens, setting a new record for a single model handling sequence sizes.

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

LongIns: A Challenging Long-context Instruction-based Exam for LLMs

$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens

LooGLE: Can Long-Context Language Models Understand Long Contexts?

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

Long-context LLMs Struggle with Long In-context Learning

Evaluating Multilingual Long-Context Models for Retrieval and Reasoning

RULER: What's the Real Context Size of Your Long-Context Language Models?

DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels

Large Language Models Can Self-Improve in Long-context Reasoning

XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models

Retrieval Meets Reasoning: Dynamic In-Context Editing for Long-Text Understanding

ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models

LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism