Abstract:Machine translation (MT) was developed as one of the hottest research topics in the natural language processing (NLP) literature. One important issue in MT is that how to evaluate the MT system reasonably and tell us whether the translation system makes an improvement or not. The traditional manual judgment methods are expensive, time-consuming, unrepeatable, and sometimes with low agreement. On the other hand, the popular automatic MT evaluation methods have some weaknesses. Firstly, they tend to perform well on the language pairs with English as the target language, but weak when English is used as source. Secondly, some methods rely on many additional linguistic features to achieve good performance, which makes the metric unable to replicate and apply to other language pairs easily. Thirdly, some popular metrics utilize incomprehensive factors, which result in low performance on some practical tasks. In this thesis, to address the existing problems, we design novel MT evaluation methods and investigate their performances on different languages. Firstly, we design augmented factors to yield highly accurate evaluation. Secondly, we design a tunable evaluation model where weighting of factors can be optimized according to the characteristics of languages. Thirdly, in the enhanced version of our methods, we design concise linguistic feature using part-of-speech (POS) to show that our methods can yield even higher performance when using some external linguistic resources. Finally, we introduce the practical performance of our metrics in the ACL-WMT workshop shared tasks, which show that the proposed methods are robust across different languages. In addition, we also present some novel work on quality estimation of MT without using reference translations including the usage of probability models of Naïve Bayes (NB), support vector machine (SVM) classification algorithms, and CRFs.

RED: A Reference Dependency Based MT Evaluation Metric.

An Automatic Machine Translation Evaluation Metric Based on Dependency Parsing Model

Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics

MT-Ranker: Reference-free machine translation evaluation by inter-system ranking

SLIDE: Reference-free Evaluation for Machine Translation using a Sliding Document Window

REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation

REFeREE: A REference-FREE Model-Based Metric for Text Simplification

Is Reference Necessary in the Evaluation of NLG Systems? When and Where?

BLEU might be Guilty but References are not Innocent

A Measure of the System Dependence of Automated Metrics

Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References

Reference-based Metrics Disprove Themselves in Question Generation

Evaluation of Machine Translation Based on Semantic Dependencies and Keywords

Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation

On the Limitations of Reference-Free Evaluations of Generated Text

Quality and Quantity of Machine Translation References for Automatic Metrics

Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics

Dependency forest for statistical machine translation

Not All Metrics Are Guilty: Improving NLG Evaluation with LLM Paraphrasing

LEPOR: An Augmented Machine Translation Evaluation Metric

Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics