Abstract:As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has traditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the capabilities of 17 leading LLMs, such as their ability to follow threads of information through the context window. Strikingly, we find that many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows. Our study also highlights the important point that token counts from different tokenizers should not be directly compared -- they often correspond to substantially different numbers of written characters. We release our code and long-context experimental data.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to evaluate the capabilities of large language models (LLMs) in handling long contexts, particularly in complex information retrieval and reasoning tasks. Specifically, the paper focuses on the following aspects: 1. **Effective Utilization of Long Contexts**: Despite the rapid development of long-context models in recent years, the understanding of how these models effectively utilize their contexts has not kept pace. The paper evaluates the performance of 17 leading LLMs across different context lengths through a series of experiments. 2. **Information Tracking Ability**: The paper designs various experimental tasks, including single needle, multiple needles, conditional needles, thread tracking, and multi-thread tracking, to test the models' ability to track and retrieve information in long contexts. These tasks require the models to find specific information fragments within a collection of documents containing a large amount of irrelevant information. 3. **Impact of Context Length**: The study finds that as the context length increases, the performance of many models declines. The paper explores the reasons for this performance drop and proposes an effective context length metric to more accurately measure the models' actual performance in different tasks. 4. **Tokenizer Differences**: The paper also points out that different models' tokenizers produce significant differences when processing the same text, affecting the actual meaning of context length. Therefore, the paper emphasizes the need to consider tokenizer differences when comparing the context capabilities of different models. ### Main Contributions 1. **Introduction of Multi-Step Thread Tracking and Multi-Thread Tracking Tasks**: These tasks require models to track multiple information chains in long contexts and ultimately retrieve the target values. 2. **Demonstration of the Impact of Increasing Context Length on Simple Needle Retrieval Tasks**: The study shows that increasing context length leads to a decline in retrieval performance, while the impact on retrieving multiple needles simultaneously is relatively small. 3. **Proof that Many Leading LLMs Exhibit Excellent Thread Safety**: These models' performance is almost unaffected when tracking multiple information chains simultaneously. 4. **Comparison of Tokenizers**: Reveals significant differences in counting among different tokenizers. 5. **Proposal of a Task-Specific and Configurable Model-Agnostic Effective Context Length Metric**: This metric helps to more accurately evaluate the models' actual performance in different tasks. Through these studies, the paper provides new insights into the capabilities of long-context LLMs and offers references for future model evaluation and improvement.

Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models

RULER: What's the Real Context Size of Your Long-Context Language Models?

Long-context LLMs Struggle with Long In-context Learning

Why Does the Effective Context Length of LLMs Fall Short?

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

Efficient Solutions For An Intriguing Failure of LLMs: Long Context Window Does Not Mean LLMs Can Analyze Long Sequences Flawlessly

Giraffe: Adventures in Expanding Context Lengths in LLMs

LooGLE: Can Long-Context Language Models Understand Long Contexts?

Evaluating Multilingual Long-Context Models for Retrieval and Reasoning

E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models

Long Context RAG Performance of Large Language Models

Make Your LLM Fully Utilize the Context

Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism

Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models

Two are better than one: Context window extension with multi-grained self-injection

A Controlled Study on Long Context Extension and Generalization in LLMs

Can Large Language Models Understand Context?