Evaluating NLP Models Via Contrast Sets.
Matt Gardner,Yoav Artzi,Victoria Basmova,Jonathan Berant,Ben Bogin,Sihao Chen,Pradeep Dasigi,Dheeru Dua,Yanai Elazar,Ananth Gottumukkala,Nitish Gupta,Hanna Hajishirzi,Gabriel Ilharco,Daniel Khashabi,Kevin Lin,Jiangming Liu,Nelson F. Liu,Phoebe Mulcaire,Qiang Ning,Sameer Singh,Noah A. Smith,Sanjay Subramanian,Reut Tsarfaty,Eric Wallace,Ally Zhang,Ben Zhou
2020-01-01
Abstract:Standard test sets for supervised learning evaluate in-distributiongeneralization. Unfortunately, when a dataset has systematic gaps (e.g.,annotation artifacts), these evaluations are misleading: a model can learnsimple decision rules that perform well on the test set but do not capture adataset's intended capabilities. We propose a new annotation paradigm for NLPthat helps to close systematic gaps in the test data. In particular, after adataset is constructed, we recommend that the dataset authors manually perturbthe test instances in small but meaningful ways that (typically) change thegold label, creating contrast sets. Contrast sets provide a local view of amodel's decision boundary, which can be used to more accurately evaluate amodel's true linguistic capabilities. We demonstrate the efficacy of contrastsets by creating them for 10 diverse NLP datasets (e.g., DROP readingcomprehension, UD parsing, IMDb sentiment analysis). Although our contrast setsare not explicitly adversarial, model performance is significantly lower onthem than on the original test sets—up to 25% in some cases. We release ourcontrast sets as new evaluation benchmarks and encourage future datasetconstruction efforts to follow similar annotation processes.