XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies

Xuanfan Ni,Hengyi Cai,Xiaochi Wei,Shuaiqiang Wang,Dawei Yin,Piji Li

2024-04-08

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes. Various efforts have been proposed to expand the context window to accommodate even up to 200K input tokens. Meanwhile, building high-quality benchmarks with much longer text lengths and more demanding tasks to provide comprehensive evaluations is of immense practical interest to facilitate long context understanding research of LLMs. However, prior benchmarks create datasets that ostensibly cater to long-text comprehension by expanding the input of traditional tasks, which falls short to exhibit the unique characteristics of long-text understanding, including long dependency tasks and longer text length compatible with modern LLMs' context window size. In this paper, we introduce a benchmark for extremely long context understanding with long-range dependencies, XL$^2$Bench, which includes three scenarios: Fiction Reading, Paper Reading, and Law Reading, and four tasks of increasing complexity: Memory Retrieval, Detailed Understanding, Overall Understanding, and Open-ended Generation, covering 27 subtasks in English and Chinese. It has an average length of 100K+ words (English) and 200K+ characters (Chinese). Evaluating six leading LLMs on XL$^2$Bench, we find that their performance significantly lags behind human levels. Moreover, the observed decline in performance across both the original and enhanced datasets underscores the efficacy of our approach to mitigating data contamination.

Computation and Language

What problem does this paper attempt to address?

This paper proposes a new benchmark called XL2Bench to address the limitation of large language models (LLMs) in dealing with extreme long contexts and long-range dependencies. Current LLMs cannot effectively understand and memorize very long inputs due to the fixed-size context window limitation. Although existing methods attempt to extend the model's context window or compress the text, there is a lack of high-quality benchmarks for evaluating the long text understanding capability of these models. XL2Bench consists of three scenarios: novel reading, paper reading, and legal reading, and designs four tasks ranging from memory retrieval to detailed comprehension, holistic understanding, and open-ended generation. It covers 27 subtasks involving English and Chinese, with an average length exceeding 100,000 words (English) and 200,000 characters (Chinese). The paper mentions that existing benchmarks usually only expand the inputs of traditional tasks, failing to fully demonstrate the unique characteristics of long text understanding, such as long-range dependencies and comprehension of long texts. By testing six leading LLMs on XL2Bench, it is found that their performances are far below human levels, and they significantly decline as the text length increases. In addition, the paper proposes three data augmentation strategies to alleviate the data pollution problem in the benchmark: text transformation, text replacement, and text concatenation. Experimental results show that even the state-of-the-art LLMs still have room for improvement in their performance on XL2Bench, especially in understanding and synthesizing long text information. This underscores the need for better methods to enhance LLMs' ability to handle long-context understanding.

XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Long-context LLMs Struggle with Long In-context Learning

LooGLE: Can Long-Context Language Models Understand Long Contexts?

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

LongSafetyBench: Long-Context LLMs Struggle with Safety Issues

MileBench: Benchmarking MLLMs in Long Context

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

RULER: What's the Real Context Size of Your Long-Context Language Models?

BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models

M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models

LongGenBench: Long-context Generation Benchmark

LongIns: A Challenging Long-context Instruction-based Exam for LLMs