Abstract:Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new English reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 96k-question benchmark, a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets. We apply state-of-the-art methods from both the reading comprehension and semantic parsing literature on this dataset and show that the best systems only achieve 32.7% F1 on our generalized accuracy metric, while expert human performance is 96.0%. We additionally present a new model that combines reading comprehension methods with simple numerical reasoning to achieve 47.0% F1.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the inadequacies of current reading comprehension systems in handling complex reasoning tasks. Although existing reading comprehension systems have achieved human-level performance on some popular datasets, they still exhibit vulnerabilities and are unable to handle problems requiring discrete reasoning (such as addition, sorting, or counting). Therefore, the paper introduces a new English reading comprehension benchmark dataset—DROP (Discrete Reasoning Over Paragraphs), to promote more comprehensive analysis of paragraph texts. ### Main Contributions 1. **New Dataset**: DROP is a crowdsourced, adversarially created dataset containing 96,567 questions that require systems to perform discrete reasoning operations within paragraphs. 2. **Complex Question Types**: The questions in DROP not only require systems to find relevant information in the paragraphs but also to perform complex language understanding and numerical reasoning. 3. **Baseline Model Evaluation**: The paper evaluates various existing methods on DROP, showing that the state-of-the-art systems can only achieve an F1 score of 32.7%, while expert human performance is 96.4%. 4. **New Model**: The paper proposes a new model, NAQANet, which combines standard reading comprehension methods with simple numerical reasoning. This model achieves an F1 score of 47.0% on DROP, improving by 14.3% over the best baseline system. ### Dataset Characteristics - **Discrete Reasoning**: The questions in DROP require systems to perform discrete reasoning operations, such as addition, sorting, or counting. - **Multi-step Reasoning**: Many questions require systems to find descriptions of multiple events and then perform aggregation operations. - **Entity Coreference**: Many questions require resolving entity coreference issues. - **Open Domain**: The dataset covers paragraphs from multiple domains, including sports summaries and historical articles. ### Methods 1. **Data Collection**: - **Paragraph Extraction**: Extracting paragraphs with narrative sequences and a high proportion of numbers from Wikipedia. - **Question Generation**: Generating questions through crowdsourcing, encouraging workers to pose questions requiring discrete reasoning, and using adversarial baseline systems to ensure question difficulty. - **Validation**: Validating the development and test sets to ensure annotation quality. 2. **Baseline Models**: - **Heuristic Baselines**: Checking for biases in the data. - **SQuAD-style Reading Comprehension Models**: Such as BiDAF, QANet, etc. - **Semantic Parsers**: Pipeline-based semantic parsers. 3. **New Model NAQANet**: - **Architecture**: Based on QANet, with added capabilities for handling numerical reasoning. - **Output Layer**: Includes four different output layers for predicting spans in the paragraph, counting, addition, and subtraction. ### Conclusion By introducing the DROP dataset and the NAQANet model, the paper demonstrates the limitations of current reading comprehension systems in handling complex reasoning tasks and proposes an initial solution. Future research can further explore how to combine neural methods and symbolic reasoning to enhance the reasoning capabilities of systems.

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

Reasoning Over Paragraph Effects in Situations

How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks

Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning

R3: A Reading Comprehension Benchmark Requiring Reasoning Processes

ReCO: A Large Scale Chinese Reading Comprehension Dataset on Opinion

Question Difficulty Ranking for Multiple-Choice Reading Comprehension

RACE: Large-scale ReAding Comprehension Dataset From Examinations

DREAM: A Challenge Dataset and Models for Dialogue-Based Reading Comprehension

Improving Reading Comprehension Question Generation with Data Augmentation and Overgenerate-and-rank

KoRC: Knowledge oriented Reading Comprehension Benchmark for Deep Text Understanding

DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension

READoc: A Unified Benchmark for Realistic Document Structured Extraction

Analysis of the Cambridge Multiple-Choice Questions Reading Dataset with a Focus on Candidate Response Distribution

Generating Distractors for Reading Comprehension Questions from Real Examinations

Embracing data abundance: BookTest Dataset for Reading Comprehension

R4C: A Benchmark for Evaluating RC Systems to Get the Right Answer for the Right Reason

LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning

CONDAQA: A Contrastive Reading Comprehension Dataset for Reasoning about Negation

DT-QDC: A Dataset for Question Comprehension in Online Test.