Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation

Bernd Bohnet,Kevin Swersky,Rosanne Liu,Pranjal Awasthi,Azade Nova,Javier Snaider,Hanie Sedghi,Aaron T Parisi,Michael Collins,Angeliki Lazaridou,Orhan Firat,Noah Fiedel

2024-06-01

Abstract:We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Previous efforts to construct such datasets relied on crowd-sourcing, but the emergence of transformers with a context size of 1 million or more tokens now enables entirely automatic approaches. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text, such as questions involving character arcs, broader themes, or the consequences of early actions later in the story. We propose a holistic pipeline for automatic data generation including question generation, answering, and model scoring using an ``Evaluator''. We find that a relative approach, comparing answers between models in a pairwise fashion and ranking with a Bradley-Terry model, provides a more consistent and differentiating scoring mechanism than an absolute scorer that rates answers individually. We also show that LLMs from different model families produce moderate agreement in their ratings. We ground our approach using the manually curated NarrativeQA dataset, where our evaluator shows excellent agreement with human judgement and even finds errors in the dataset. Using our automatic evaluation approach, we show that using an entire book as context produces superior reading comprehension performance compared to baseline no-context (parametric knowledge only) and retrieval-based approaches.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to utilize large-scale language models (LLMs) with long-context capabilities to create synthetic reading comprehension datasets derived from the content of entire books. Specifically, the goals of the paper include the following aspects: 1. **Automatic Generation of Complex Questions and Answers**: - Utilize LLMs with long-context processing capabilities to automatically generate high-quality question-answer pairs that require deep understanding and reasoning over large amounts of text. - The generated questions not only need factual accuracy but also the ability to synthesize information. 2. **Automatic Evaluation of Model Performance**: - Propose a relative scoring method to rank different models by comparing their performance in answering the same question. - Use the Bradley-Terry model for relative scoring to provide a more consistent and discriminative evaluation mechanism. 3. **Validation of the Method's Effectiveness**: - Use the human-annotated NarrativeQA dataset to validate the quality of the automatically generated questions and answers. - Identify some errors in the dataset and demonstrate that using the entire book as context can significantly improve reading comprehension performance. In summary, the main goal of the paper is to explore how to utilize long-context LLMs to automatically generate high-quality reading comprehension data and propose a method to automatically evaluate the quality of this data. This not only accelerates the dataset construction process but also improves the accuracy and efficiency of the evaluation.

Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation

NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens

Leveraging Large Language Models for Multiple Choice Question Answering

Enhanced Story Comprehension for Large Language Models through Dynamic Document-Based Knowledge Graphs

QuALITY: Question Answering with Long Input Texts, Yes!

Examining Long-Context Large Language Models for Environmental Review Document Comprehension

DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels

Towards Automatic Generation of Questions from Long Answers

ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering

Investigating Answerability of LLMs for Long-Form Question Answering

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering

Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding

Writing your own book: A method for going from closed to open book QA to improve robustness and performance of smaller LLMs

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

LooGLE: Can Long-Context Language Models Understand Long Contexts?

FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language Models

GenSco: Can Question Decomposition based Passage Alignment improve Question Answering?

Comparison of Large Language Models for Generating Contextually Relevant Questions

L-Eval: Instituting Standardized Evaluation for Long Context Language Models