Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT.

Qingyu Lu,Baopu Qiu,Liang Ding,Kanjian Zhang,Tom Kocmi,Dacheng Tao
DOI: https://doi.org/10.48550/arxiv.2303.13809
2023-01-01
Abstract:Generative large language models (LLMs), e.g., ChatGPT, have demonstratedremarkable proficiency across several NLP tasks, such as machine translation,text summarization. Recent research (Kocmi and Federmann, 2023) has shown thatutilizing LLMs for assessing the quality of machine translation (MT) achievesstate-of-the-art performance at the system level but performs poorly atthe segment level. To further improve the performance of LLMs on MT qualityassessment, we investigate several prompting designs, and propose a newprompting method called (EAPrompt)by combining Chain-of-Thoughts (Wei et al., 2022) and Error Analysis (Lu etal., 2023). This technique emulates the commonly accepted human evaluationframework - Multidimensional Quality Metrics (MQM, Freitag et al. (2021)) andproduces explainable and reliable MT evaluations at both the system andsegment level. Experimental Results from the WMT22 metrics shared taskvalidate the effectiveness of EAPrompt on various LLMs, with differentstructures. Further analysis confirms that EAPrompt effectively distinguishesmajor errors from minor ones, while also sharing a similar distribution of thenumber of errors with MQM. These findings highlight the potential of EAPromptas a human-like evaluator prompting technique for MT evaluation.
What problem does this paper attempt to address?