LooGLE: Can Long-Context Language Models Understand Long Contexts?

Jiaqi Li,Mengmeng Wang,Zilong Zheng,Muhan Zhang
2024-09-06
Abstract:Large language models (LLMs), despite their impressive performance in various language tasks, are typically limited to processing texts within context-window size. This limitation has spurred significant research efforts to enhance LLMs' long-context understanding with high-quality long-sequence benchmarks. However, prior datasets in this regard suffer from shortcomings, such as short context length compared to the context window of modern LLMs; outdated documents that have data leakage problems; and an emphasis on short dependency tasks rather than long dependency tasks. In this paper, we present LooGLE, a Long Context Generic Language Evaluation benchmark for LLMs' long context understanding. LooGLE features relatively new documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning diverse domains. Human annotators meticulously crafted more than 1,100 high-quality question-answer pairs to meet the long dependency requirements. These pairs underwent thorough cross-validation, yielding the most precise assessment of LLMs' long dependency capabilities. The evaluation of eight state-of-the-art LLMs on LooGLE revealed key findings: (i) commercial models outperformed open-sourced models; (ii) LLMs excelled in short dependency tasks like short question-answering and cloze tasks but struggled with more intricate long dependency tasks; (iii) in-context learning and chaining thoughts offered only marginal improvements; (iv) retrieval-based techniques demonstrated substantial benefits for short question-answering, while strategies for extending context window length had limited impact on long context understanding. As such, LooGLE not only provides a systematic and comprehensive evaluation schema on long-context LLMs, but also sheds light on future development of enhanced models towards "true long-context understanding".
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the limitations of large language models (LLMs) in handling long texts. Specifically, although LLMs perform excellently in various language tasks, they are usually constrained by the context window size, making it difficult for them to effectively understand and process extremely long texts. To solve this problem, the paper proposes a new benchmark tool—ooGLE, designed to evaluate the understanding capabilities of LLMs for long texts. The main contributions of ooGLE include: 1. **Dataset Characteristics**: It contains the latest documents (published after 2022), with each document exceeding 24,000 tokens, and includes more than 6,000 newly generated questions covering multiple domains. 2. **High-Quality Question-Answer Pairs**: Over 1,100 high-quality question-answer pairs are meticulously designed through manual annotation to meet the requirements of long dependency relationships. 3. **Comprehensive Evaluation**: Eight state-of-the-art LLMs were evaluated, revealing several key findings: - Commercial models outperform open-source models; - LLMs perform well on short dependency tasks (such as short Q&A, cloze tests, etc.) but poorly on complex long dependency tasks; - Only marginal improvements were achieved in context learning and chain of thought; - Retrieval-based techniques have significant advantages in short Q&A tasks, while methods that extend the context window length by optimizing transformer architecture or position encoding are limited in effectiveness. 4. **Future Directions**: ooGLE not only provides a systematic and comprehensive evaluation framework but also points out future directions for enhancing models to achieve "true long text understanding." Through this work, the paper hopes to advance research on LLMs in handling long texts and provide valuable references for subsequent model development.