LongIns: A Challenging Long-context Instruction-based Exam for LLMs

Shawn Gavin,Tuney Zheng,Jiaheng Liu,Quehry Que,Noah Wang,Jian Yang,Chenchen Zhang,Wenhao Huang,Wenhu Chen,Ge Zhang

2024-06-26

Abstract:The long-context capabilities of large language models (LLMs) have been a hot topic in recent years. To evaluate the performance of LLMs in different scenarios, various assessment benchmarks have emerged. However, as most of these benchmarks focus on identifying key information to answer questions, which mainly requires the retrieval ability of LLMs, these benchmarks can partially represent the reasoning performance of LLMs from large amounts of information. Meanwhile, although LLMs often claim to have context windows of 32k, 128k, 200k, or even longer, these benchmarks fail to reveal the actual supported length of these LLMs. To address these issues, we propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs, which is built based on the existing instruction datasets. Specifically, in our LongIns, we introduce three evaluation settings: Global Instruction & Single Task (GIST), Local Instruction & Single Task (LIST), and Local Instruction & Multiple Tasks (LIMT). Based on LongIns, we perform comprehensive evaluations on existing LLMs and have the following important findings: (1). The top-performing GPT-4 with 128k context length performs poorly on the evaluation context window of 16k in our LongIns. (2). For the multi-hop reasoning ability of many existing LLMs, significant efforts are still needed under short context windows (less than 4k).

Computation and Language

What problem does this paper attempt to address?

The paper primarily aims to address the performance evaluation issue of large language models (LLMs) in the context of handling long texts. Most current benchmarks focus on the information retrieval capabilities of LLMs rather than their reasoning abilities, and these benchmarks fail to reveal the actual text length supported by LLMs. Therefore, the authors propose a new benchmark dataset named LongIns, specifically designed to evaluate the performance of LLMs in understanding long sequences. LongIns includes three evaluation settings: Global Instruction with Single Task (GIST), Local Instruction with Single Task (LIST), and Local Instruction with Multiple Tasks (LIMT), to comprehensively assess the capabilities of existing LLMs in the context of long texts. Through experiments on various LLMs, the study finds that even models like GPT-4, which claim to have long context windows, perform far below expectations when handling real long-sequence tasks, indicating that current LLMs still have significant shortcomings in dealing with such tasks.

LongIns: A Challenging Long-context Instruction-based Exam for LLMs

Long-context LLMs Struggle with Long In-context Learning

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens

LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios

LooGLE: Can Long-Context Language Models Understand Long Contexts?

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies

CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models

RULER: What's the Real Context Size of Your Long-Context Language Models?

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models

A Controlled Study on Long Context Extension and Generalization in LLMs

LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

Make Your LLM Fully Utilize the Context

LongSafetyBench: Long-Context LLMs Struggle with Safety Issues

Hyper-multi-step: The Truth Behind Difficult Long-context Tasks