Abstract:Question-answering (QA) systems are becoming more and more important because they enable human-computer communication in a natural language. In recent years, significant progress has been made with transformer-based models that leverage deep learning in combination with large amounts of text data. However, a significant challenge with QA systems lies in their complexity rooted in the ambiguity and flexibility of a natural language. This makes even their evaluation a formidable task. For this reason, in this study, we focus on the evaluation of extractive question-answering (EQA) systems by conducting a large-scale analysis of distilBERT using benchmark data provided by the Stanford Question Answering Dataset (SQuAD). Specifically, the main objectives of this paper are fourfold. First, we study the influence of the answer length on the performance and we demonstrate that there is an inverse correlation between both. Second, we study differences in exact match (EM) measures because there are different definitions commonly used in the literature. As a result, we find that despite the fact that all of those measures are named "exact match" these measures are actually different from each other. Third, we study the practical relevance of these different definitions because due to the ambivalent meaning of "exact match" in the literature, it is often unclear if reported improvements are genuine or only due to a change in the exact match measure. Importantly, our results show that differences between differently defined EM measures are in the same order of magnitude as reported differences found in the literature. This raises concerns about the robustness of reported results. Fourth, we provide guidelines to improve the experimental design of general EQA studies, aiming to enhance performance evaluation and minimize the potential for spurious results.

What's in a Name? Answer Equivalence For Open-Domain Question Answering

Open Domain Question Answering Via Semantic Enrichment

PEDANTS: Cheap but Effective and Interpretable Answer Equivalence

CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering

Knowledge-Aided Open-Domain Question Answering

(QA)$^2$: Question Answering with Questionable Assumptions

Experimental Design of Extractive Question-Answering Systems: Influence of Error Scores and Answer Length

Hybrid Question Answering over Knowledge Base and Free Text.

KBQA: Learning Question Answering over QA Corpora and Knowledge Bases

An Open Domain Question Answering System Based on Improved System Similarity Model

L2R-QA: An Open-Domain Question Answering Framework

Exploiting Abstract Meaning Representation for Open-Domain Question Answering

Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets

Ditch the Gold Standard: Re-evaluating Conversational Question Answering

Evaluating Open-QA Evaluation

Long-Tailed Question Answering in an Open World.

Answerability in Retrieval-Augmented Open-Domain Question Answering

XAIQA: Explainer-Based Data Augmentation for Extractive Question Answering

SEMQA: Semi-Extractive Multi-Source Question Answering

Answering Science Exam Questions Using Query Rewriting with Background Knowledge

'Just because you are right, doesn't mean I am wrong': Overcoming a Bottleneck in the Development and Evaluation of Open-Ended Visual Question Answering (VQA) Tasks