Abstract:Language models (LMs) have demonstrated an improved capacity to handle long-context information, yet existing long-context benchmarks primarily measure LMs' retrieval abilities with extended inputs, e.g., pinpointing a short phrase from long-form text. Therefore, they may fall short when evaluating models' global context understanding capacity, such as synthesizing and reasoning over content across input to generate the response. In this paper, we study long-context language model (LCLM) evaluation through many-shot in-context learning (ICL). Concretely, we identify the skills each ICL task requires, and examine models' long-context capabilities on them. We first ask: What types of ICL tasks benefit from additional demonstrations, and are these tasks effective at evaluating LCLMs? We find that classification and summarization tasks show notable performance improvements with additional demonstrations, while translation and reasoning tasks do not exhibit clear trends. This suggests the classification tasks predominantly test models' retrieval skills. Next, we ask: To what extent does each task require retrieval skills versus global context understanding from LCLMs? We develop metrics to categorize ICL tasks into two groups: (i) retrieval tasks that require strong retrieval ability to pinpoint relevant examples, and (ii) global context understanding tasks that necessitate a deeper comprehension of the full input. We find that not all datasets can effectively evaluate these long-context capabilities. To address this gap, we introduce a new many-shot ICL benchmark, MANYICLBENCH, designed to characterize LCLMs' retrieval and global context understanding capabilities separately. Benchmarking 11 open-weight LCLMs with MANYICLBENCH, we find that while state-of-the-art models perform well in retrieval tasks up to 64k tokens, many show significant drops in global context tasks at just 16k tokens.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the current evaluation of long - context language models (LCLMs) mainly focuses on retrieval ability, while lacking an effective evaluation of their global - context understanding ability. Specifically, the existing long - context benchmark tests mainly measure the model's ability to retrieve specific information from a large amount of text, such as locating phrases from long documents. However, these tests may not fully assess the model's comprehensive and reasoning abilities when dealing with the global context. Therefore, the author proposes a new multi - example in - context learning (ICL) benchmark test - MANY ICLBENCH, which aims to evaluate the retrieval ability and global - context understanding ability of LCLMs respectively. ### Main contributions of the paper: 1. **Research on which tasks benefit from additional examples**: The author experimentally explored whether different types of ICL tasks (classification, summarization, translation, reasoning, etc.) benefit from more examples and evaluated the effectiveness of these tasks in evaluating LCLMs. 2. **Develop methods to characterize the main skills evaluated by ICL tasks**: The author distinguished whether the task mainly tests retrieval ability or global - context understanding ability. 3. **Construct a new multi - example ICL benchmark test**: MANY ICLBENCH aims to evaluate the abilities of LCLMs in retrieval and global - context understanding, excluding irrelevant datasets previously used for LCLM evaluation. 4. **Comprehensive benchmarking of 11 state - of - the - art LCLMs**: Use MANY ICLBENCH to evaluate the performance of these models to provide a comprehensive comparison. ### Main findings: - **Classification tasks benefit from more examples**: Classification tasks show a performance improvement that is strongly positively correlated with the context length, indicating that classification tasks mainly test the model's retrieval ability. - **Summarization tasks gradually benefit from more examples**: Summarization tasks also show a performance improvement that is positively correlated with the context length, but as the number of examples increases, the performance improvement gradually slows down. - **Translation tasks have large performance fluctuations**: Translation tasks in different languages perform differently, and the performance gap in low - resource languages is larger. - **Mathematical tasks benefit from additional examples, especially for stronger models**: In mathematical reasoning tasks, models with larger numbers of parameters perform better, and as the number of examples increases, the performance continues to improve. - **The trends in scientific and symbolic reasoning tasks are inconsistent**: The performance of these tasks varies from model to model, with some models benefiting from more examples and some showing no obvious change. ### Methods: - **Retrieval load ratio**: Evaluate the retrieval - dependence degree of the task by removing the most similar and least similar examples. - **Global - context index**: Evaluate the global - context understanding requirements of the task by comparing the impact of adding duplicate examples and unique examples on performance. ### Conclusion: MANY ICLBENCH provides a framework for comprehensively evaluating the retrieval and global - context understanding abilities of LCLMs, which helps to more accurately evaluate the comprehensive performance of these models.

Retrieval or Global Context Understanding? On Many-Shot In-Context Learning for Long-Context Evaluation

Revisiting In-Context Learning with Long Context Language Models

Does In-Context Learning Really Learn? Rethinking How Large Language Models Respond and Solve Tasks via In-Context Learning

Many-Shot In-Context Learning

Large Language Models Know What Makes Exemplary Contexts

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

LooGLE: Can Long-Context Language Models Understand Long Contexts?

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

In-Context Learning with Long-Context Models: An In-Depth Exploration

Long-context LLMs Struggle with Long In-context Learning

VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning

Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study

CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models

ICLEval: Evaluating In-Context Learning Ability of Large Language Models

Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly

Can Few-shot Work in Long-Context? Recycling the Context to Generate Demonstrations

In-Context Learning for Text Classification with Many Labels

Investigating the Learning Behaviour of In-Context Learning: A Comparison with Supervised Learning