XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies

Xuanfan Ni,Hengyi Cai,Xiaochi Wei,Shuaiqiang Wang,Dawei Yin,Piji Li
2024-04-08
Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes. Various efforts have been proposed to expand the context window to accommodate even up to 200K input tokens. Meanwhile, building high-quality benchmarks with much longer text lengths and more demanding tasks to provide comprehensive evaluations is of immense practical interest to facilitate long context understanding research of LLMs. However, prior benchmarks create datasets that ostensibly cater to long-text comprehension by expanding the input of traditional tasks, which falls short to exhibit the unique characteristics of long-text understanding, including long dependency tasks and longer text length compatible with modern LLMs' context window size. In this paper, we introduce a benchmark for extremely long context understanding with long-range dependencies, XL$^2$Bench, which includes three scenarios: Fiction Reading, Paper Reading, and Law Reading, and four tasks of increasing complexity: Memory Retrieval, Detailed Understanding, Overall Understanding, and Open-ended Generation, covering 27 subtasks in English and Chinese. It has an average length of 100K+ words (English) and 200K+ characters (Chinese). Evaluating six leading LLMs on XL$^2$Bench, we find that their performance significantly lags behind human levels. Moreover, the observed decline in performance across both the original and enhanced datasets underscores the efficacy of our approach to mitigating data contamination.
Computation and Language
What problem does this paper attempt to address?
This paper proposes a new benchmark called XL2Bench to address the limitation of large language models (LLMs) in dealing with extreme long contexts and long-range dependencies. Current LLMs cannot effectively understand and memorize very long inputs due to the fixed-size context window limitation. Although existing methods attempt to extend the model's context window or compress the text, there is a lack of high-quality benchmarks for evaluating the long text understanding capability of these models. XL2Bench consists of three scenarios: novel reading, paper reading, and legal reading, and designs four tasks ranging from memory retrieval to detailed comprehension, holistic understanding, and open-ended generation. It covers 27 subtasks involving English and Chinese, with an average length exceeding 100,000 words (English) and 200,000 characters (Chinese). The paper mentions that existing benchmarks usually only expand the inputs of traditional tasks, failing to fully demonstrate the unique characteristics of long text understanding, such as long-range dependencies and comprehension of long texts. By testing six leading LLMs on XL2Bench, it is found that their performances are far below human levels, and they significantly decline as the text length increases. In addition, the paper proposes three data augmentation strategies to alleviate the data pollution problem in the benchmark: text transformation, text replacement, and text concatenation. Experimental results show that even the state-of-the-art LLMs still have room for improvement in their performance on XL2Bench, especially in understanding and synthesizing long text information. This underscores the need for better methods to enhance LLMs' ability to handle long-context understanding.