Abstract:The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context. We evaluate 17 long-context LMs with 13 representative tasks in RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K. Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity. We open source RULER to spur comprehensive evaluation of long-context LMs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the real - context - processing ability of long - context language models. Existing evaluation methods mainly rely on simple retrieval tasks, such as the Needle - in - a - Haystack (NIAH) test, which can only reflect the basic retrieval ability of the model and cannot comprehensively evaluate its understanding ability of long - context. For this reason, the author proposes a new benchmarking tool **RULER**, aiming to more comprehensively evaluate the ability of long - context language models through diverse task configurations. ### Specific problems and solutions: 1. **Limitations of existing evaluation methods**: - Existing evaluation methods (such as the NIAH test) mainly focus on the model's ability to retrieve specific information from long texts, but these methods are too simple to comprehensively evaluate the model's understanding ability of long - context. - These methods usually only test the model's retrieval ability and ignore other important behaviors, such as multi - hop tracking, information aggregation, etc. 2. **Proposal of RULER**: - **RULER** is a synthetic benchmarking tool that contains multiple task categories and aims to evaluate the comprehensive ability of long - context language models. - RULER contains the following four task categories: 1. **Retrieval**: It extends the traditional NIAH test, including different types of "needles" and "haystacks", to evaluate the model's retrieval ability in complex situations. 2. **Multi - hop Tracing**: It introduces variable - tracking tasks to simulate coreference chain resolution and test the model's ability to track entities in long inputs. 3. **Information Aggregation**: Through the tasks of extracting common words and high - frequency words, it tests the model's ability to aggregate relevant information in long - context. 4. **Question Answering**: It adds interference information to the existing short - context question - answering datasets to evaluate the model's question - answering ability under different context lengths. 3. **Experimental results**: - The author used RULER to evaluate 17 long - context language models, including 15 open - source models and 2 closed - source models. - The experimental results show that although these models perform nearly perfectly in the simple NIAH test, in the complex tasks of RULER, as the context length increases, the performance drops significantly. - Only half of the models can maintain satisfactory performance at a 32K context length, even though they claim to support longer contexts. 4. **Contributions**: - Proposed a new benchmarking tool RULER for more comprehensively evaluating long - context language models. - Introduced new task categories, especially multi - hop tracking and information aggregation, to test the model's comprehensive ability in long - context. - Conducted a detailed evaluation and analysis of 17 long - context language models, revealing the performance differences of the models under different tasks and context lengths. Through RULER, the author hopes to promote more in - depth research and evaluation of long - context language models, thereby promoting the further development of this field.

RULER: What's the Real Context Size of Your Long-Context Language Models?

$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models

XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies

LooGLE: Can Long-Context Language Models Understand Long Contexts?

Long-context LLMs Struggle with Long In-context Learning

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

LongIns: A Challenging Long-context Instruction-based Exam for LLMs

Long Context RAG Performance of Large Language Models

CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly

LongGenBench: Long-context Generation Benchmark

Marathon: A Race Through the Realm of Long Context with Large Language Models