Abstract:State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation. This paper introduces LV-Eval, a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, mitigated knowledge leakage, and more objective evaluations. We evaluate 15 LLMs on LV-Eval and conduct ablation studies on the benchmarking techniques. The results reveal that: (i) Moonshot-v1 and recent large-scale open-source models, such as Qwen-2.5-72B and Llama-3.1-70B, achieve the highest performance on LV-Eval, particularly at lengths below 64k. (ii) Models exhibit distinct score trends. For example, GLM-4-9B-128k, Yi-6B-200k, and Llama3-8B-1M exhibit a relatively gentle degradation of performance, but their absolute performances may not necessarily be higher than those of LLMs with shorter context lengths. (iii) LLMs' performances can significantly degrade in the presence of confusing information, especially in the pressure test of "needle in a haystack". (iv) Issues related to knowledge leakage and inaccurate metrics introduce bias in evaluation, and these concerns are alleviated in LV-Eval. All datasets and evaluation codes are released at: <a class="link-external link-https" href="https://github.com/infinigence/LVEval" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper aims to address several key issues in evaluating the long text comprehension capabilities of current large language models (LLMs). Specifically: 1. **Insufficient Long Context Benchmarks**: Existing long context benchmark datasets are generally short (about 5k to 21k), which cannot fully evaluate the ultra-long contexts supported by the latest LLMs (such as 256k or more). 2. **Knowledge Leakage Issue**: The documents used in existing benchmark datasets may overlap with the data used during the training of some LLMs, causing the models to answer questions through memory or common sense rather than truly understanding the long context. 3. **Evaluation Bias**: Current automatic evaluation metrics are easily affected by changes in answer format and irrelevant vocabulary, leading to inaccurate scoring. To address these issues, the researchers proposed LV-Eval, a bilingual long context benchmark dataset with five different length levels (16k, 32k, 64k, 128k, and 256k). The main features of LV-Eval include: - Controllability: The same set of question-answer pairs are tested at different length levels, facilitating the evaluation of the model's ability to handle long texts. - Introduction of Confusing Facts: By inserting confusing facts generated by GPT-4 and manually corrected into the context, the difficulty of the test is increased. - Replacement of Keywords and Phrases: To avoid evaluation bias caused by knowledge leakage. - New Evaluation Metric Based on Keyword Recall: A two-stage evaluation method is designed, first calculating the recall rate of keywords in the answers, and then calculating the F1 score to improve scoring objectivity. By evaluating the performance of 15 LLMs on LV-Eval, the study found that: - Recent large-scale open-source models (such as Qwen-2.5-72B and Llama-3.1-70B) perform best at the 16k and 32k length levels. - The performance of the models significantly declines in the presence of confusing information, especially in the "needle in a haystack" test. - Knowledge leakage and inaccurate evaluation metrics can lead to evaluation bias, which LV-Eval effectively mitigates.

LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

RULER: What's the Real Context Size of Your Long-Context Language Models?

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens

LooGLE: Can Long-Context Language Models Understand Long Contexts?

MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens

XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies

LongIns: A Challenging Long-context Instruction-based Exam for LLMs

Long-context LLMs Struggle with Long In-context Learning

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Marathon: A Race Through the Realm of Long Context with Large Language Models

Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models

LongSafetyBench: Long-Context LLMs Struggle with Safety Issues

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

LogEval: A Comprehensive Benchmark Suite for Large Language Models In Log Analysis

A Controlled Study on Long Context Extension and Generalization in LLMs