LooGLE: Can Long-Context Language Models Understand Long Contexts?

Jiaqi Li,Mengmeng Wang,Zilong Zheng,Muhan Zhang

2024-09-06

Abstract:Large language models (LLMs), despite their impressive performance in various language tasks, are typically limited to processing texts within context-window size. This limitation has spurred significant research efforts to enhance LLMs' long-context understanding with high-quality long-sequence benchmarks. However, prior datasets in this regard suffer from shortcomings, such as short context length compared to the context window of modern LLMs; outdated documents that have data leakage problems; and an emphasis on short dependency tasks rather than long dependency tasks. In this paper, we present LooGLE, a Long Context Generic Language Evaluation benchmark for LLMs' long context understanding. LooGLE features relatively new documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning diverse domains. Human annotators meticulously crafted more than 1,100 high-quality question-answer pairs to meet the long dependency requirements. These pairs underwent thorough cross-validation, yielding the most precise assessment of LLMs' long dependency capabilities. The evaluation of eight state-of-the-art LLMs on LooGLE revealed key findings: (i) commercial models outperformed open-sourced models; (ii) LLMs excelled in short dependency tasks like short question-answering and cloze tasks but struggled with more intricate long dependency tasks; (iii) in-context learning and chaining thoughts offered only marginal improvements; (iv) retrieval-based techniques demonstrated substantial benefits for short question-answering, while strategies for extending context window length had limited impact on long context understanding. As such, LooGLE not only provides a systematic and comprehensive evaluation schema on long-context LLMs, but also sheds light on future development of enhanced models towards "true long-context understanding".

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the limitations of large language models (LLMs) in handling long texts. Specifically, although LLMs perform excellently in various language tasks, they are usually constrained by the context window size, making it difficult for them to effectively understand and process extremely long texts. To solve this problem, the paper proposes a new benchmark tool—ooGLE, designed to evaluate the understanding capabilities of LLMs for long texts. The main contributions of ooGLE include: 1. **Dataset Characteristics**: It contains the latest documents (published after 2022), with each document exceeding 24,000 tokens, and includes more than 6,000 newly generated questions covering multiple domains. 2. **High-Quality Question-Answer Pairs**: Over 1,100 high-quality question-answer pairs are meticulously designed through manual annotation to meet the requirements of long dependency relationships. 3. **Comprehensive Evaluation**: Eight state-of-the-art LLMs were evaluated, revealing several key findings: - Commercial models outperform open-source models; - LLMs perform well on short dependency tasks (such as short Q&A, cloze tests, etc.) but poorly on complex long dependency tasks; - Only marginal improvements were achieved in context learning and chain of thought; - Retrieval-based techniques have significant advantages in short Q&A tasks, while methods that extend the context window length by optimizing transformer architecture or position encoding are limited in effectiveness. 4. **Future Directions**: ooGLE not only provides a systematic and comprehensive evaluation framework but also points out future directions for enhancing models to achieve "true long text understanding." Through this work, the paper hopes to advance research on LLMs in handling long texts and provide valuable references for subsequent model development.

LooGLE: Can Long-Context Language Models Understand Long Contexts?

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models

Long-context LLMs Struggle with Long In-context Learning

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

A Controlled Study on Long Context Extension and Generalization in LLMs

Can Large Language Models Understand Context?

Large Language Models Can Self-Improve in Long-context Reasoning

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

LongIns: A Challenging Long-context Instruction-based Exam for LLMs

E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

FltLM: An Intergrated Long-Context Large Language Model for Effective Context Filtering and Understanding

RULER: What's the Real Context Size of Your Long-Context Language Models?

M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models

Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding

LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K