Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

Jonathan Roberts,Kai Han,Samuel Albanie
2024-11-08
Abstract:As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has traditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the capabilities of 17 leading LLMs, such as their ability to follow threads of information through the context window. Strikingly, we find that many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows. Our study also highlights the important point that token counts from different tokenizers should not be directly compared -- they often correspond to substantially different numbers of written characters. We release our code and long-context experimental data.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to evaluate the capabilities of large language models (LLMs) in handling long contexts, particularly in complex information retrieval and reasoning tasks. Specifically, the paper focuses on the following aspects: 1. **Effective Utilization of Long Contexts**: Despite the rapid development of long-context models in recent years, the understanding of how these models effectively utilize their contexts has not kept pace. The paper evaluates the performance of 17 leading LLMs across different context lengths through a series of experiments. 2. **Information Tracking Ability**: The paper designs various experimental tasks, including single needle, multiple needles, conditional needles, thread tracking, and multi-thread tracking, to test the models' ability to track and retrieve information in long contexts. These tasks require the models to find specific information fragments within a collection of documents containing a large amount of irrelevant information. 3. **Impact of Context Length**: The study finds that as the context length increases, the performance of many models declines. The paper explores the reasons for this performance drop and proposes an effective context length metric to more accurately measure the models' actual performance in different tasks. 4. **Tokenizer Differences**: The paper also points out that different models' tokenizers produce significant differences when processing the same text, affecting the actual meaning of context length. Therefore, the paper emphasizes the need to consider tokenizer differences when comparing the context capabilities of different models. ### Main Contributions 1. **Introduction of Multi-Step Thread Tracking and Multi-Thread Tracking Tasks**: These tasks require models to track multiple information chains in long contexts and ultimately retrieve the target values. 2. **Demonstration of the Impact of Increasing Context Length on Simple Needle Retrieval Tasks**: The study shows that increasing context length leads to a decline in retrieval performance, while the impact on retrieving multiple needles simultaneously is relatively small. 3. **Proof that Many Leading LLMs Exhibit Excellent Thread Safety**: These models' performance is almost unaffected when tracking multiple information chains simultaneously. 4. **Comparison of Tokenizers**: Reveals significant differences in counting among different tokenizers. 5. **Proposal of a Task-Specific and Configurable Model-Agnostic Effective Context Length Metric**: This metric helps to more accurately evaluate the models' actual performance in different tasks. Through these studies, the paper provides new insights into the capabilities of long-context LLMs and offers references for future model evaluation and improvement.