Abstract:Large language models (LLMs), despite their impressive performance in various language tasks, are typically limited to processing texts within context-window size. This limitation has spurred significant research efforts to enhance LLMs' long-context understanding with high-quality long-sequence benchmarks. However, prior datasets in this regard suffer from shortcomings, such as short context length compared to the context window of modern LLMs; outdated documents that have data leakage problems; and an emphasis on short dependency tasks rather than long dependency tasks. In this paper, we present LooGLE, a Long Context Generic Language Evaluation benchmark for LLMs' long context understanding. LooGLE features relatively new documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning diverse domains. Human annotators meticulously crafted more than 1,100 high-quality question-answer pairs to meet the long dependency requirements. These pairs underwent thorough cross-validation, yielding the most precise assessment of LLMs' long dependency capabilities. The evaluation of eight state-of-the-art LLMs on LooGLE revealed key findings: (i) commercial models outperformed open-sourced models; (ii) LLMs excelled in short dependency tasks like short question-answering and cloze tasks but struggled with more intricate long dependency tasks; (iii) in-context learning and chaining thoughts offered only marginal improvements; (iv) retrieval-based techniques demonstrated substantial benefits for short question-answering, while strategies for extending context window length had limited impact on long context understanding. As such, LooGLE not only provides a systematic and comprehensive evaluation schema on long-context LLMs, but also sheds light on future development of enhanced models towards "true long-context understanding".

Do Long-Range Language Models Actually Use Long-Range Context?

Lost in the Middle: How Language Models Use Long Contexts

What Kinds of Tokens Benefit from Distant Text? An Analysis on Long Context Language Modeling

How to Train Long-Context Language Models (Effectively)

Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

Empower Your Model with Longer and Better Context Comprehension

Long Context RAG Performance of Large Language Models

Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval

RULER: What's the Real Context Size of Your Long-Context Language Models?

Equipping Transformer with Random-Access Reading for Long-Context Understanding

How much do contextualized representations encode long-range context?

Retrieval meets Long Context Large Language Models

Insights into LLM Long-Context Failures: When Transformers Know but Don't Tell

Evaluating Multilingual Long-Context Models for Retrieval and Reasoning

Long-Short Range Context Neural Networks for Language Modeling

Long-Context Language Modeling with Parallel Context Encoding

Efficient Long-range Language Modeling with Self-supervised Causal Retrieval

Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models

Why Does the Effective Context Length of LLMs Fall Short?

LooGLE: Can Long-Context Language Models Understand Long Contexts?

Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism