Abstract:Recent machine translation (MT) metrics calibrate their effectiveness by correlating with human judgement but without any insights about their behaviour across different error types. Challenge sets are used to probe specific dimensions of metric behaviour but there are very few such datasets and they either focus on a limited number of phenomena or a limited number of language pairs. We introduce ACES, a contrastive challenge set spanning 146 language pairs, aimed at discovering whether metrics can identify 68 translation accuracy errors. These phenomena range from simple alterations at the word/character level to more complex errors based on discourse and real-world knowledge. We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks. We benchmark metric performance, assess their incremental performance over successive campaigns, and measure their sensitivity to a range of linguistic phenomena. We also investigate claims that Large Language Models (LLMs) are effective as MT evaluators by evaluating on ACES. Our results demonstrate that different metric families struggle with different phenomena and that LLM-based methods fail to demonstrate reliable performance. Our analyses indicate that most metrics ignore the source sentence, tend to prefer surface-level overlap and end up incorporating properties of base models which are not always beneficial. We expand ACES to include error span annotations, denoted as SPAN-ACES and we use this dataset to evaluate span-based error metrics showing these metrics also need considerable improvement. Finally, we provide a set of recommendations for building better MT metrics, including focusing on error labels instead of scores, ensembling, designing strategies to explicitly focus on the source sentence, focusing on semantic content and choosing the right base model for representations.

Training and Meta-Evaluating Machine Translation Evaluation Metrics at the Paragraph Level

Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Extrinsic Evaluation of Machine Translation Metrics

LEPOR: An Augmented Machine Translation Evaluation Metric

A Comprehensive Survey on Various Fully Automatic Machine Translation Evaluation Metrics

Trained MT Metrics Learn to Cope with Machine-translated References

Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets

BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation

The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics

Evaluating Automatic Metrics with Incremental Machine Translation Systems

Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation

A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task

Optimizing Non-Decomposable Evaluation Metrics for Neural Machine Translation

Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation

Can Automatic Metrics Assess High-Quality Translations?

SLIDE: Reference-free Evaluation for Machine Translation using a Sliding Document Window

Quality and Quantity of Machine Translation References for Automatic Metrics

Difficulty-Aware Machine Translation Evaluation

A Benchmark for Evaluating Machine Translation Metrics on Dialects Without Standard Orthography

Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In!

SSMT:A Machine Translation Evaluation View to Paragraph-to-Sentence Semantic Similarity