Abstract:Recent advancements in large language models (LLM) capable of processing extremely long texts highlight the need for a dedicated evaluation benchmark to assess their long-context capabilities. However, existing methods, like the needle-in-a-haystack test, do not effectively assess whether these models fully utilize contextual information, raising concerns about the reliability of current evaluation techniques. To thoroughly examine the effectiveness of existing benchmarks, we introduce a new metric called information coverage (IC), which quantifies the proportion of the input context necessary for answering queries. Our findings indicate that current benchmarks exhibit low IC; although the input context may be extensive, the actual usable context is often limited. To address this, we present ETHIC, a novel benchmark designed to assess LLMs' ability to leverage the entire context. Our benchmark comprises 2,648 test instances spanning four long-context tasks with high IC scores in the domains of books, debates, medicine, and law. Our evaluations reveal significant performance drops in contemporary LLMs, highlighting a critical challenge in managing long contexts. Our benchmark is available at <a class="link-external link-https" href="https://github.com/dmis-lab/ETHIC" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing long - context evaluation benchmarks cannot fully evaluate whether large language models (LLMs) can make full use of the provided context information. Specifically, existing evaluation methods, such as the "needle - in - a - haystack" test, fail to effectively evaluate whether these models fully utilize the context information, leading to concerns about the reliability of current evaluation techniques. To address this issue, the author introduced a new evaluation benchmark named ETHIC (Evaluating Large Language Models on Long - Context Tasks with High Information Coverage). This benchmark aims to evaluate whether LLMs can make full use of the entire context and measures the performance of the models through tasks in four different fields (books, debates, medicine, and law). ETHIC contains 2,648 test instances, covering tasks with high information coverage (IC), thus more comprehensively evaluating the ability of LLMs to handle long - context. ### Main Problem Summary: 1. **Limitations of Existing Evaluation Methods**: Existing evaluation methods (such as the "needle - in - a - haystack" test) fail to fully evaluate whether LLMs can make full use of context information. 2. **Lack of Evaluation Benchmarks with High Information Coverage**: Existing benchmarks often only focus on specific parts of the context information, ignoring the utilization rate of the overall information. 3. **Need for New Evaluation Criteria**: In order to better evaluate the ability of LLMs to handle long - context, new evaluation benchmarks and metrics need to be introduced. ### Solutions: - **Introducing Information Coverage (IC)**: A new metric - information coverage (IC) - is defined to quantify the proportion of the context required to answer a query. - **Developing the ETHIC Benchmark**: The ETHIC benchmark is created, covering tasks in four fields, ensuring that each task requires the model to make full use of all relevant information. - **Evaluating the Performance of the Latest LLMs**: By evaluating the latest LLMs that support at least 128K tokens, it is revealed that these models perform poorly in high - IC tasks, highlighting the need for further research. Through these measures, the ETHIC benchmark provides a more rigorous and comprehensive evaluation framework, which is helpful for promoting the research and development of long - context processing capabilities.

ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage

ICLEval: Evaluating In-Context Learning Ability of Large Language Models

CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models

Long-context LLMs Struggle with Long In-context Learning

ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

LongIns: A Challenging Long-context Instruction-based Exam for LLMs

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

LooGLE: Can Long-Context Language Models Understand Long Contexts?

RULER: What's the Real Context Size of Your Long-Context Language Models?

Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

Retrieval or Global Context Understanding? On Many-Shot In-Context Learning for Long-Context Evaluation

LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens

LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios

Large Language Models in Healthcare: A Comprehensive Benchmark