Abstract:Traditionally, Machine Translation (MT) Evaluation has been treated as a regression problem -- producing an absolute translation-quality score. This approach has two limitations: i) the scores lack interpretability, and human annotators struggle with giving consistent scores; ii) most scoring methods are based on (reference, translation) pairs, limiting their applicability in real-world scenarios where references are absent. In practice, we often care about whether a new MT system is better or worse than some competitors. In addition, reference-free MT evaluation is increasingly practical and necessary. Unfortunately, these two practical considerations have yet to be jointly explored. In this work, we formulate the reference-free MT evaluation into a pairwise ranking problem. Given the source sentence and a pair of translations, our system predicts which translation is better. In addition to proposing this new formulation, we further show that this new paradigm can demonstrate superior correlation with human judgments by merely using indirect supervision from natural language inference and weak supervision from our synthetic data. In the context of reference-free evaluation, MT-Ranker, trained without any human annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21. On a more challenging benchmark, ACES, which contains fine-grained evaluation criteria such as addition, omission, and mistranslation errors, MT-Ranker marks state-of-the-art against reference-free as well as reference-based baselines.

Quality Estimation with $k$-nearest Neighbors and Automatic Evaluation for Model-specific Quality Estimation

Unsupervised Quality Estimation for Neural Machine Translation

Self-Supervised Quality Estimation for Machine Translation.

Faster Nearest Neighbor Machine Translation

Pushing the Right Buttons: Adversarial Evaluation of Quality Estimation

QUAK: A Synthetic Quality Estimation Dataset for Korean-English Neural Machine Translation

Information Dropping Data Augmentation for Machine Translation Quality Estimation

Practical Perspectives on Quality Estimation for Machine Translation

Multi-Dimensional Machine Translation Evaluation: Model Evaluation and Resource for Korean

From Handcrafted Features to LLMs: A Brief Survey for Machine Translation Quality Estimation

Beyond Glass-Box Features: Uncertainty Quantification Enhanced Quality Estimation for Neural Machine Translation

NJUNLP's Submission for CCMT20 Quality Estimation Task.

Quality Estimation & Interpretability for Code Translation

DeepSubQE: Quality estimation for subtitle translations

QE-EBM: Using Quality Estimators as Energy Loss for Machine Translation

Mismatching-aware unsupervised translation quality estimation for low-resource languages

MT-Ranker: Reference-free machine translation evaluation by inter-system ranking

SumQE: a BERT-based Summary Quality Estimation Model

Quality Estimation of Machine Translated Texts based on Direct Evidence from Training Data

Submissions for the WMT 19 Quality Estimation Shared Task