Abstract:Machine translation (MT) was developed as one of the hottest research topics in the natural language processing (NLP) literature. One important issue in MT is that how to evaluate the MT system reasonably and tell us whether the translation system makes an improvement or not. The traditional manual judgment methods are expensive, time-consuming, unrepeatable, and sometimes with low agreement. On the other hand, the popular automatic MT evaluation methods have some weaknesses. Firstly, they tend to perform well on the language pairs with English as the target language, but weak when English is used as source. Secondly, some methods rely on many additional linguistic features to achieve good performance, which makes the metric unable to replicate and apply to other language pairs easily. Thirdly, some popular metrics utilize incomprehensive factors, which result in low performance on some practical tasks. In this thesis, to address the existing problems, we design novel MT evaluation methods and investigate their performances on different languages. Firstly, we design augmented factors to yield highly accurate evaluation. Secondly, we design a tunable evaluation model where weighting of factors can be optimized according to the characteristics of languages. Thirdly, in the enhanced version of our methods, we design concise linguistic feature using part-of-speech (POS) to show that our methods can yield even higher performance when using some external linguistic resources. Finally, we introduce the practical performance of our metrics in the ACL-WMT workshop shared tasks, which show that the proposed methods are robust across different languages. In addition, we also present some novel work on quality estimation of MT without using reference translations including the usage of probability models of Naïve Bayes (NB), support vector machine (SVM) classification algorithms, and CRFs.

Grammar Accuracy Evaluation (GAE): Quantifiable Quantitative Evaluation of Machine Translation Models

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

Evaluation of really good grammatical error correction

A New Evaluation Method: Evaluation Data and Metrics for Chinese Grammar Error Correction

Multi-Dimensional Machine Translation Evaluation: Model Evaluation and Resource for Korean

Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT.

Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods

QE-EBM: Using Quality Estimators as Energy Loss for Machine Translation

On Accurate Evaluation of GANs for Language Generation

GRUEN for Evaluating Linguistic Quality of Generated Text

Convergences and Divergences between Automatic Assessment and Human Evaluation: Insights from Comparing ChatGPT-Generated Translation and Neural Machine Translation

Enhancing Machine Translation Quality Estimation via Fine-Grained Error Analysis and Large Language Model

Automatic Arabic Grammatical Error Correction based on Expectation-Maximization routing and target-bidirectional agreement

The Unbearable Weight of Generating Artificial Errors for Grammatical Error Correction

LEPOR: An Augmented Machine Translation Evaluation Metric

Is ChatGPT a Good NLG Evaluator? A Preliminary Study

TransGEC: Improving Grammatical Error Correction with Translationese

Comparison of Grammatical Error Correction Using Back-Translation Models

Leveraging Denoised Abstract Meaning Representation for Grammatical Error Correction

Neural Quality Estimation with Multiple Hypotheses for Grammatical Error Correction