DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

Dheeru Dua,Yizhong Wang,Pradeep Dasigi,Gabriel Stanovsky,Sameer Singh,Matt Gardner
DOI: https://doi.org/10.48550/arXiv.1903.00161
2019-04-17
Abstract:Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new English reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 96k-question benchmark, a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets. We apply state-of-the-art methods from both the reading comprehension and semantic parsing literature on this dataset and show that the best systems only achieve 32.7% F1 on our generalized accuracy metric, while expert human performance is 96.0%. We additionally present a new model that combines reading comprehension methods with simple numerical reasoning to achieve 47.0% F1.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the inadequacies of current reading comprehension systems in handling complex reasoning tasks. Although existing reading comprehension systems have achieved human-level performance on some popular datasets, they still exhibit vulnerabilities and are unable to handle problems requiring discrete reasoning (such as addition, sorting, or counting). Therefore, the paper introduces a new English reading comprehension benchmark dataset—DROP (Discrete Reasoning Over Paragraphs), to promote more comprehensive analysis of paragraph texts. ### Main Contributions 1. **New Dataset**: DROP is a crowdsourced, adversarially created dataset containing 96,567 questions that require systems to perform discrete reasoning operations within paragraphs. 2. **Complex Question Types**: The questions in DROP not only require systems to find relevant information in the paragraphs but also to perform complex language understanding and numerical reasoning. 3. **Baseline Model Evaluation**: The paper evaluates various existing methods on DROP, showing that the state-of-the-art systems can only achieve an F1 score of 32.7%, while expert human performance is 96.4%. 4. **New Model**: The paper proposes a new model, NAQANet, which combines standard reading comprehension methods with simple numerical reasoning. This model achieves an F1 score of 47.0% on DROP, improving by 14.3% over the best baseline system. ### Dataset Characteristics - **Discrete Reasoning**: The questions in DROP require systems to perform discrete reasoning operations, such as addition, sorting, or counting. - **Multi-step Reasoning**: Many questions require systems to find descriptions of multiple events and then perform aggregation operations. - **Entity Coreference**: Many questions require resolving entity coreference issues. - **Open Domain**: The dataset covers paragraphs from multiple domains, including sports summaries and historical articles. ### Methods 1. **Data Collection**: - **Paragraph Extraction**: Extracting paragraphs with narrative sequences and a high proportion of numbers from Wikipedia. - **Question Generation**: Generating questions through crowdsourcing, encouraging workers to pose questions requiring discrete reasoning, and using adversarial baseline systems to ensure question difficulty. - **Validation**: Validating the development and test sets to ensure annotation quality. 2. **Baseline Models**: - **Heuristic Baselines**: Checking for biases in the data. - **SQuAD-style Reading Comprehension Models**: Such as BiDAF, QANet, etc. - **Semantic Parsers**: Pipeline-based semantic parsers. 3. **New Model NAQANet**: - **Architecture**: Based on QANet, with added capabilities for handling numerical reasoning. - **Output Layer**: Includes four different output layers for predicting spans in the paragraph, counting, addition, and subtraction. ### Conclusion By introducing the DROP dataset and the NAQANet model, the paper demonstrates the limitations of current reading comprehension systems in handling complex reasoning tasks and proposes an initial solution. Future research can further explore how to combine neural methods and symbolic reasoning to enhance the reasoning capabilities of systems.