Training and Meta-Evaluating Machine Translation Evaluation Metrics at the Paragraph Level

Daniel Deutsch,Juraj Juraska,Mara Finkelstein,Markus Freitag
2023-08-29
Abstract:As research on machine translation moves to translating text beyond the sentence level, it remains unclear how effective automatic evaluation metrics are at scoring longer translations. In this work, we first propose a method for creating paragraph-level data for training and meta-evaluating metrics from existing sentence-level data. Then, we use these new datasets to benchmark existing sentence-level metrics as well as train learned metrics at the paragraph level. Interestingly, our experimental results demonstrate that using sentence-level metrics to score entire paragraphs is equally as effective as using a metric designed to work at the paragraph level. We speculate this result can be attributed to properties of the task of reference-based evaluation as well as limitations of our datasets with respect to capturing all types of phenomena that occur in paragraph-level translations.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issue of whether existing automatic evaluation metrics are effective when assessing machine translation quality at the paragraph level. As machine translation research expands from the sentence level to paragraphs, chapters, or documents, evaluating the translation quality of these longer texts has become a new challenge. The authors propose a method to create paragraph-level datasets for training and meta-evaluation metrics using existing sentence-level data, and benchmark both existing sentence-level metrics and newly trained paragraph-level metrics with these new datasets. Experimental results show that using sentence-level metrics to evaluate entire paragraphs performs comparably to metrics specifically designed for paragraph-level evaluation. This may be due to the nature of the reference-based evaluation task and the limitations of paragraph-level datasets in capturing phenomena such as long-distance dependencies. In short, the paper mainly explores how to effectively evaluate paragraph-level machine translation quality and whether existing sentence-level evaluation metrics can be directly applied to paragraph-level evaluation without the need for special adjustments or retraining for the paragraph level.