Re-evaluating Evaluation in Text Summarization

Manik Bhandari,Pranav Gour,Atabak Ashfaq,Pengfei Liu,Graham Neubig
DOI: https://doi.org/10.48550/arXiv.2010.07100
2020-10-14
Abstract:Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not -- for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both system-level and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
Computation and Language,Information Retrieval,Machine Learning
What problem does this paper attempt to address?