ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage

Taewhoo Lee,Chanwoong Yoon,Kyochul Jang,Donghyeon Lee,Minju Song,Hyunjae Kim,Jaewoo Kang
2024-10-22
Abstract:Recent advancements in large language models (LLM) capable of processing extremely long texts highlight the need for a dedicated evaluation benchmark to assess their long-context capabilities. However, existing methods, like the needle-in-a-haystack test, do not effectively assess whether these models fully utilize contextual information, raising concerns about the reliability of current evaluation techniques. To thoroughly examine the effectiveness of existing benchmarks, we introduce a new metric called information coverage (IC), which quantifies the proportion of the input context necessary for answering queries. Our findings indicate that current benchmarks exhibit low IC; although the input context may be extensive, the actual usable context is often limited. To address this, we present ETHIC, a novel benchmark designed to assess LLMs' ability to leverage the entire context. Our benchmark comprises 2,648 test instances spanning four long-context tasks with high IC scores in the domains of books, debates, medicine, and law. Our evaluations reveal significant performance drops in contemporary LLMs, highlighting a critical challenge in managing long contexts. Our benchmark is available at <a class="link-external link-https" href="https://github.com/dmis-lab/ETHIC" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing long - context evaluation benchmarks cannot fully evaluate whether large language models (LLMs) can make full use of the provided context information. Specifically, existing evaluation methods, such as the "needle - in - a - haystack" test, fail to effectively evaluate whether these models fully utilize the context information, leading to concerns about the reliability of current evaluation techniques. To address this issue, the author introduced a new evaluation benchmark named ETHIC (Evaluating Large Language Models on Long - Context Tasks with High Information Coverage). This benchmark aims to evaluate whether LLMs can make full use of the entire context and measures the performance of the models through tasks in four different fields (books, debates, medicine, and law). ETHIC contains 2,648 test instances, covering tasks with high information coverage (IC), thus more comprehensively evaluating the ability of LLMs to handle long - context. ### Main Problem Summary: 1. **Limitations of Existing Evaluation Methods**: Existing evaluation methods (such as the "needle - in - a - haystack" test) fail to fully evaluate whether LLMs can make full use of context information. 2. **Lack of Evaluation Benchmarks with High Information Coverage**: Existing benchmarks often only focus on specific parts of the context information, ignoring the utilization rate of the overall information. 3. **Need for New Evaluation Criteria**: In order to better evaluate the ability of LLMs to handle long - context, new evaluation benchmarks and metrics need to be introduced. ### Solutions: - **Introducing Information Coverage (IC)**: A new metric - information coverage (IC) - is defined to quantify the proportion of the context required to answer a query. - **Developing the ETHIC Benchmark**: The ETHIC benchmark is created, covering tasks in four fields, ensuring that each task requires the model to make full use of all relevant information. - **Evaluating the Performance of the Latest LLMs**: By evaluating the latest LLMs that support at least 128K tokens, it is revealed that these models perform poorly in high - IC tasks, highlighting the need for further research. Through these measures, the ETHIC benchmark provides a more rigorous and comprehensive evaluation framework, which is helpful for promoting the research and development of long - context processing capabilities.