Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

Chonghua Wang,Haodong Duan,Songyang Zhang,Dahua Lin,Kai Chen

2024-04-10

Abstract:Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs' capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models' long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct long-text test sets based on open-source datasets, focusing mainly on QA and summarization tasks. These datasets include test samples of varying lengths (from 2k to 32k+) entangled together, making it challenging to assess model capabilities across different length ranges. Moreover, they do not cover the ultralong settings (100k+ tokens) that the latest LLMs claim to achieve. In this paper, we introduce Ada-LEval, a length-adaptable benchmark for evaluating the long-context understanding of LLMs. Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. These benchmarks support intricate manipulation of the length of test cases, and can easily produce text samples up to 128k tokens. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval. The evaluation results demonstrate the limitations of current LLMs, especially in ultra-long-context settings. Our code is available at

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address some key issues in the current evaluation of long text processing capabilities. Specifically: 1. **Lack of ultra-long text settings**: Existing long text evaluation benchmarks (such as L-Eval and LongBench) rarely include ultra-long text settings (32,000 tokens or longer), which limits the understanding of model performance under extreme context lengths. 2. **Mixing test samples of different length ranges**: Existing benchmarks include test samples of different lengths, making it difficult to assess the model's capabilities across different length ranges. 3. **Limitations of traditional tasks**: Existing long text evaluation benchmarks mainly focus on traditional tasks such as question answering and summarization, which often do not require a comprehensive understanding of the entire text, thus failing to fully evaluate the model's long text comprehension ability. To address these issues, the paper introduces Ada-LEval, an adjustable-length benchmark for evaluating the long context understanding capabilities of large language models (LLMs). Ada-LEval includes two challenging sub-tasks: TSort and BestAnswer, which can more reliably assess the long context capabilities of LLMs and support the generation of text samples up to 128,000 tokens.

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models

LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

LooGLE: Can Long-Context Language Models Understand Long Contexts?

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

RULER: What's the Real Context Size of Your Long-Context Language Models?

Long-context LLMs Struggle with Long In-context Learning

M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models

XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies

A Controlled Study on Long Context Extension and Generalization in LLMs

LongIns: A Challenging Long-context Instruction-based Exam for LLMs

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens

BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models

LongSafetyBench: Long-Context LLMs Struggle with Safety Issues

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens

E^2-LLM: Efficient and Extreme Length Extension of Large Language Models