Abstract:Machine translation (MT) was developed as one of the hottest research topics in the natural language processing (NLP) literature. One important issue in MT is that how to evaluate the MT system reasonably and tell us whether the translation system makes an improvement or not. The traditional manual judgment methods are expensive, time-consuming, unrepeatable, and sometimes with low agreement. On the other hand, the popular automatic MT evaluation methods have some weaknesses. Firstly, they tend to perform well on the language pairs with English as the target language, but weak when English is used as source. Secondly, some methods rely on many additional linguistic features to achieve good performance, which makes the metric unable to replicate and apply to other language pairs easily. Thirdly, some popular metrics utilize incomprehensive factors, which result in low performance on some practical tasks. In this thesis, to address the existing problems, we design novel MT evaluation methods and investigate their performances on different languages. Firstly, we design augmented factors to yield highly accurate evaluation. Secondly, we design a tunable evaluation model where weighting of factors can be optimized according to the characteristics of languages. Thirdly, in the enhanced version of our methods, we design concise linguistic feature using part-of-speech (POS) to show that our methods can yield even higher performance when using some external linguistic resources. Finally, we introduce the practical performance of our metrics in the ACL-WMT workshop shared tasks, which show that the proposed methods are robust across different languages. In addition, we also present some novel work on quality estimation of MT without using reference translations including the usage of probability models of Naïve Bayes (NB), support vector machine (SVM) classification algorithms, and CRFs.

Using Mechanical Turk to Build Machine Translation Evaluation Sets

Evaluating Online Labor Markets for Experimental Research: Amazon.com's Mechanical Turk

Amazon's Mechanical Turk

Improving Data Quality Using Amazon Mechanical Turk Through Platform Setup

Machine translation: An American perspective

A new deal for translation quality

Bleu: a Method for Automatic Evaluation of Machine Translation

An Evaluation of Amazon’s Mechanical Turk, Its Rapid Rise, and Its Effective Use

Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation

The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

Translationese in Machine Translation Evaluation

Pushing the Right Buttons: Adversarial Evaluation of Quality Estimation

Amazon Mechanical Turk in Organizational Psychology: An Evaluation and Practical Recommendations

On the Evaluation Practices in Multilingual NLP: Can Machine Translation Offer an Alternative to Human Translations?

Fine-grained evaluation of German-English Machine Translation based on a Test Suite

Running experiments on Amazon Mechanical Turk

An Overview on Machine Translation Evaluation

Difficulty-Aware Machine Translation Evaluation

LEPOR: An Augmented Machine Translation Evaluation Metric

MTUncertainty: Assessing the Need for Post-editing of Machine Translation Outputs by Fine-tuning OpenAI LLMs

Beyond Human-Only: Evaluating Human-Machine Collaboration for Collecting High-Quality Translation Data