Abstract:LLMs and RAG systems are now capable of handling millions of input tokens or more. However, evaluating the output quality of such systems on long-context tasks remains challenging, as tasks like Needle-in-a-Haystack lack complexity. In this work, we argue that summarization can play a central role in such evaluation. We design a procedure to synthesize Haystacks of documents, ensuring that specific \textit{insights} repeat across documents. The "Summary of a Haystack" (SummHay) task then requires a system to process the Haystack and generate, given a query, a summary that identifies the relevant insights and precisely cites the source documents. Since we have precise knowledge of what insights should appear in a haystack summary and what documents should be cited, we implement a highly reproducible automatic evaluation that can score summaries on two aspects - Coverage and Citation. We generate Haystacks in two domains (conversation, news), and perform a large-scale evaluation of 10 LLMs and corresponding 50 RAG systems. Our findings indicate that SummHay is an open challenge for current systems, as even systems provided with an Oracle signal of document relevance lag our estimate of human performance (56\%) by 10+ points on a Joint Score. Without a retriever, long-context LLMs like GPT-4o and Claude 3 Opus score below 20% on SummHay. We show SummHay can also be used to study enterprise RAG systems and position bias in long-context models. We hope future systems can equal and surpass human performance on SummHay.

What problem does this paper attempt to address?

This paper proposes a task called "Summary of a Haystack" (SummHay) which aims to address the performance issues of long-context language models (LLMs) and retrieval-augmented generation systems (RAG) in handling large amounts of input information. Existing evaluation methods such as the "Needle-in-a-Haystack" task are too simplistic and fail to fully showcase the capabilities of the latest models. Therefore, the researchers generated a document collection called "Haystack" by synthesizing data, in which specific insights are repeated across multiple documents. The SummHay task requires systems to process the Haystack, generate summaries based on queries, identify relevant insights, and accurately cite the source documents. The evaluation criteria include "coverage" (whether the expected reference insights are included) and "citation quality". By using automated evaluation methods, the researchers conducted large-scale evaluations on 10 Haystacks in the domains of dialogue and news, involving 10 LLMs and 50 RAG systems. The study found that even with the document relevance oracle signals, the current systems still exhibit much lower performance on the SummHay task compared to the estimated 56% performance of human judges. LLMs without a retrieval component tend to score lower, while RAG systems generally improve citation quality but may sacrifice insight coverage. Furthermore, the study revealed the issue of positional bias in long-context models, which tend to focus on information at the top or bottom of the context window. The main contributions of the paper include: 1) designing the process for generating Haystacks; 2) proposing the evaluation protocol for SummHay; 3) conducting large-scale evaluations on 50 RAG systems and 10 long-context LLMs, demonstrating that SummHay is a challenging open problem. The researchers hope that future technologies can achieve and surpass human performance on SummHay, providing more reliable answer engines.

Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

Towards a Robust Retrieval-Based Summarization System

UniSumEval: Towards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs

BooookScore: A systematic exploration of book-length summarization in the era of LLMs

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

On Context Utilization in Summarization with Large Language Models

Characterizing Multimodal Long-form Summarization: A Case Study on Financial Reports

Benchmarking Large Language Models for News Summarization

Hybrid Long Document Summarization using C2F-FAR and ChatGPT: A Practical Study

LCFO: Long Context and Long Form Output Dataset and Benchmarking

Learning to Summarize from LLM-generated Feedback

AugSumm: towards generalizable speech summarization using synthetic labels from large language model

Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method

Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles

UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches

Controllable Multi-document Summarization: Coverage & Coherence Intuitive Policy with Large Language Model Based Rewards

Towards Dataset-scale and Feature-oriented Evaluation of Text Summarization in Large Language Model Prompts

Can Large Language Models Serve as Evaluators for Code Summarization?

Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues and Documents

FABLES: Evaluating faithfulness and content selection in book-length summarization