MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics

Anthony Chen,Gabriel Stanovsky,Sameer Singh,Matt Gardner
DOI: https://doi.org/10.18653/v1/2020.emnlp-main.528
2020-10-16
Abstract:Posing reading comprehension as a generation problem provides a great deal of flexibility, allowing for open-ended questions with few restrictions on possible answers. However, progress is impeded by existing generation metrics, which rely on token overlap and are agnostic to the nuances of reading comprehension. To address this, we introduce a benchmark for training and evaluating generative reading comprehension metrics: MOdeling Correctness with Human Annotations. MOCHA contains 40K human judgement scores on model outputs from 6 diverse question answering datasets and an additional set of minimal pairs for evaluation. Using MOCHA, we train a Learned Evaluation metric for Reading Comprehension, LERC, to mimic human judgement scores. LERC outperforms baseline metrics by 10 to 36 absolute Pearson points on held-out annotations. When we evaluate robustness on minimal pairs, LERC achieves 80% accuracy, outperforming baselines by 14 to 26 absolute percentage points while leaving significant room for improvement. MOCHA presents a challenging problem for developing accurate and robust generative reading comprehension metrics.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the deficiencies of existing Generative Reading Comprehension (GRC) evaluation metrics. Specifically: 1. **Limitations of existing metrics**: Current text generation evaluation metrics (such as BLEU, ROUGE, METEOR, etc.) mainly rely on n - gram overlap to evaluate the quality of generated answers, and these metrics are not sensitive to the special requirements of reading comprehension tasks. For example, they cannot take into account context information, the specific requirements of the question, and the semantic correctness of the generated answer. 2. **Challenges in generative reading comprehension**: The generative reading comprehension task allows open - ended questions to be asked, and there are no strict limitations on possible answers. Although this flexibility increases the complexity of the task, it also makes evaluation more difficult. Existing evaluation metrics cannot fully capture the subtle differences between the generated answer and the reference answer, thus affecting the accurate evaluation of model performance. To solve these problems, the author introduced a new dataset MOCHA (MOdeling Correctness with Human Annotations) for training and evaluating evaluation metrics for generative reading comprehension. MOCHA contains 40,000 human - scored model outputs from 6 different question - answering datasets, as well as an additional set of minimal pairs for evaluation. By using the MOCHA dataset, the author trained an evaluation metric named LERC (Learned Evaluation metric for Reading Comprehension), which can better simulate human scoring and thus more accurately evaluate the performance of generative reading comprehension models. Experimental results show that LERC significantly outperforms existing evaluation metrics on multiple datasets and also shows higher accuracy when dealing with minimal pairs.