LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

Tao Yuan,Xuefei Ning,Dong Zhou,Zhijie Yang,Shiyao Li,Minghui Zhuang,Zheyue Tan,Zhuyu Yao,Dahua Lin,Boxun Li,Guohao Dai,Shengen Yan,Yu Wang
2024-10-03
Abstract:State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation. This paper introduces LV-Eval, a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, mitigated knowledge leakage, and more objective evaluations. We evaluate 15 LLMs on LV-Eval and conduct ablation studies on the benchmarking techniques. The results reveal that: (i) Moonshot-v1 and recent large-scale open-source models, such as Qwen-2.5-72B and Llama-3.1-70B, achieve the highest performance on LV-Eval, particularly at lengths below 64k. (ii) Models exhibit distinct score trends. For example, GLM-4-9B-128k, Yi-6B-200k, and Llama3-8B-1M exhibit a relatively gentle degradation of performance, but their absolute performances may not necessarily be higher than those of LLMs with shorter context lengths. (iii) LLMs' performances can significantly degrade in the presence of confusing information, especially in the pressure test of "needle in a haystack". (iv) Issues related to knowledge leakage and inaccurate metrics introduce bias in evaluation, and these concerns are alleviated in LV-Eval. All datasets and evaluation codes are released at: <a class="link-external link-https" href="https://github.com/infinigence/LVEval" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address several key issues in evaluating the long text comprehension capabilities of current large language models (LLMs). Specifically: 1. **Insufficient Long Context Benchmarks**: Existing long context benchmark datasets are generally short (about 5k to 21k), which cannot fully evaluate the ultra-long contexts supported by the latest LLMs (such as 256k or more). 2. **Knowledge Leakage Issue**: The documents used in existing benchmark datasets may overlap with the data used during the training of some LLMs, causing the models to answer questions through memory or common sense rather than truly understanding the long context. 3. **Evaluation Bias**: Current automatic evaluation metrics are easily affected by changes in answer format and irrelevant vocabulary, leading to inaccurate scoring. To address these issues, the researchers proposed LV-Eval, a bilingual long context benchmark dataset with five different length levels (16k, 32k, 64k, 128k, and 256k). The main features of LV-Eval include: - Controllability: The same set of question-answer pairs are tested at different length levels, facilitating the evaluation of the model's ability to handle long texts. - Introduction of Confusing Facts: By inserting confusing facts generated by GPT-4 and manually corrected into the context, the difficulty of the test is increased. - Replacement of Keywords and Phrases: To avoid evaluation bias caused by knowledge leakage. - New Evaluation Metric Based on Keyword Recall: A two-stage evaluation method is designed, first calculating the recall rate of keywords in the answers, and then calculating the F1 score to improve scoring objectivity. By evaluating the performance of 15 LLMs on LV-Eval, the study found that: - Recent large-scale open-source models (such as Qwen-2.5-72B and Llama-3.1-70B) perform best at the 16k and 32k length levels. - The performance of the models significantly declines in the presence of confusing information, especially in the "needle in a haystack" test. - Knowledge leakage and inaccurate evaluation metrics can lead to evaluation bias, which LV-Eval effectively mitigates.