Abstract:To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than relying on summaries or excerpts. In addition, only half of the questions are answerable by annotators working under tight time constraints, indicating that skimming and simple search are not enough to consistently perform well. Our baseline models perform poorly on this task (55.4%) and significantly lag behind human performance (93.5%).

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of existing natural language understanding models when dealing with long documents. Most of the current state - of - the - art natural language understanding models can only process texts of a few hundred words, which restricts their performance in tasks that require an overall understanding of the entire paragraph. To address this issue, the author introduced a new multiple - choice question - answering dataset **QuALITY**, which contains English context paragraphs with an average length of about 5,000 tokens, far exceeding the length that existing models can handle. Specifically, the main contributions and goals of the paper include: 1. **Constructing a dataset for long - document understanding**: The context paragraphs in the QuALITY dataset have an average length of about 5,000 tokens, much longer than most existing datasets. This enables researchers to test and develop models that can perform reasoning and understanding in long documents. 2. **Ensuring the quality and challenge of questions**: Unlike previous datasets that rely on summaries or excerpts to generate questions, the questions in QuALITY are written by contributors who have read the full paragraphs and have been strictly verified. Moreover, only about half of the questions can be answered under time pressure, meaning that simple browsing or simple searching is not sufficient to answer these questions correctly. 3. **Evaluating the performance of existing models**: The author used several existing deep - learning models (such as Longformer, RoBERTa, DeBERTaV3, and T5) to test the performance of these models on the QuALITY dataset. The results show that even the best - performing model (DeBERTaV3 - large combined with DPR extraction) has an accuracy rate of only 55.4%, far lower than the human accuracy rate (93.5%). This indicates that existing models still have significant deficiencies when dealing with long documents. Through these efforts, the author hopes to promote the research community to develop more powerful models to handle tasks that require a comprehensive understanding of long documents, such as news understanding, summarization, or applied question - answering systems. ### Formula Explanation The article does not involve specific mathematical formulas, but mentions some technical details, such as: - Using ROUGE - 1 recall, fastText, and DPR models for sentence selection. - The Longformer model supports an input of up to 4,096 tokens, while the LED model supports an input of up to 16,384 tokens. These technical details can help in understanding how models process information in long documents.

QuALITY: Question Answering with Long Input Texts, Yes!

NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens

Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation

ConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers

JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension

Analysis of QA System Behavior against Context and Question Changes

Long-Tailed Question Answering in an Open World.

ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

Towards Automatic Generation of Questions from Long Answers

FinTextQA: A Dataset for Long-form Financial Question Answering

WebCPM: Interactive Web Search for Chinese Long-form Question Answering.

DocFinQA: A Long-Context Financial Reasoning Dataset

Never Lost in the Middle: Mastering Long-Context Question Answering with Position-Agnostic Decompositional Training

SEMQA: Semi-Extractive Multi-Source Question Answering

ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

StoryQA : Story Grounded Question Answering Dataset

Detect, Retrieve, Comprehend: A Flexible Framework for Zero-Shot Document-Level Question Answering

LONGAGENT: Achieving Question Answering for 128K-Token-long Documents Through Multi-Agent Collaboration

PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them

AttenWalker: Unsupervised Long-Document Question Answering via Attention-based Graph Walking