QuALITY: Question Answering with Long Input Texts, Yes!

Richard Yuanzhe Pang,Alicia Parrish,Nitish Joshi,Nikita Nangia,Jason Phang,Angelica Chen,Vishakh Padmakumar,Johnny Ma,Jana Thompson,He He,Samuel R. Bowman
DOI: https://doi.org/10.48550/arXiv.2112.08608
2022-05-11
Abstract:To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than relying on summaries or excerpts. In addition, only half of the questions are answerable by annotators working under tight time constraints, indicating that skimming and simple search are not enough to consistently perform well. Our baseline models perform poorly on this task (55.4%) and significantly lag behind human performance (93.5%).
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing natural language understanding models when dealing with long documents. Most of the current state - of - the - art natural language understanding models can only process texts of a few hundred words, which restricts their performance in tasks that require an overall understanding of the entire paragraph. To address this issue, the author introduced a new multiple - choice question - answering dataset **QuALITY**, which contains English context paragraphs with an average length of about 5,000 tokens, far exceeding the length that existing models can handle. Specifically, the main contributions and goals of the paper include: 1. **Constructing a dataset for long - document understanding**: The context paragraphs in the QuALITY dataset have an average length of about 5,000 tokens, much longer than most existing datasets. This enables researchers to test and develop models that can perform reasoning and understanding in long documents. 2. **Ensuring the quality and challenge of questions**: Unlike previous datasets that rely on summaries or excerpts to generate questions, the questions in QuALITY are written by contributors who have read the full paragraphs and have been strictly verified. Moreover, only about half of the questions can be answered under time pressure, meaning that simple browsing or simple searching is not sufficient to answer these questions correctly. 3. **Evaluating the performance of existing models**: The author used several existing deep - learning models (such as Longformer, RoBERTa, DeBERTaV3, and T5) to test the performance of these models on the QuALITY dataset. The results show that even the best - performing model (DeBERTaV3 - large combined with DPR extraction) has an accuracy rate of only 55.4%, far lower than the human accuracy rate (93.5%). This indicates that existing models still have significant deficiencies when dealing with long documents. Through these efforts, the author hopes to promote the research community to develop more powerful models to handle tasks that require a comprehensive understanding of long documents, such as news understanding, summarization, or applied question - answering systems. ### Formula Explanation The article does not involve specific mathematical formulas, but mentions some technical details, such as: - Using ROUGE - 1 recall, fastText, and DPR models for sentence selection. - The Longformer model supports an input of up to 4,096 tokens, while the LED model supports an input of up to 16,384 tokens. These technical details can help in understanding how models process information in long documents.