Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation Across Languages, Domains, and Expertise Levels

Jianhao Yan,Pingchuan Yan,Yulong Chen,Jing Li,Xianchao Zhu,Yue Zhang
2024-11-21
Abstract:This study presents a comprehensive evaluation of GPT-4's translation capabilities compared to human translators of varying expertise levels. Through systematic human evaluation using the MQM schema, we assess translations across three language pairs (Chinese$\longleftrightarrow$English, Russian$\longleftrightarrow$English, and Chinese$\longleftrightarrow$Hindi) and three domains (News, Technology, and Biomedical). Our findings reveal that GPT-4 achieves performance comparable to junior-level translators in terms of total errors, while still lagging behind senior translators. Unlike traditional Neural Machine Translation systems, which show significant performance degradation in resource-poor language directions, GPT-4 maintains consistent translation quality across all evaluated language pairs. Through qualitative analysis, we identify distinctive patterns in translation approaches: GPT-4 tends toward overly literal translations and exhibits lexical inconsistency, while human translators sometimes over-interpret context and introduce hallucinations. This study represents the first systematic comparison between LLM and human translators across different proficiency levels, providing valuable insights into the current capabilities and limitations of LLM-based translation systems.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the performance of large language models (LLMs) in translation tasks, especially the comparison between GPT - 4 and human translators at different levels. Specifically, through a systematic human evaluation method and using the Multidimensional Quality Metrics (MQM) framework, the researchers comprehensively evaluated the translation ability of GPT - 4 and compared it with the translation quality of junior, intermediate and senior human translators. The study covered three language pairs (Chinese - English translation, Russian - English translation, Chinese - Hindi translation) and three fields (news, technology, biomedicine), aiming to reveal the translation performance of GPT - 4 in language directions and fields with different resource richness, as well as its advantages and limitations compared with human translators. The main objectives of the paper include: 1. **Evaluating the translation performance of GPT - 4**: Determining the translation quality of GPT - 4 in different language pairs and fields, especially its performance in language directions with scarce resources. 2. **Comparing GPT - 4 with human translators**: Through systematic human evaluation, comparing the translation quality of GPT - 4 with that of human translators at different levels and finding the systematic differences between them. 3. **Analyzing error types in translation**: Identifying common error types in the translation process of GPT - 4 and human translators, such as mistranslation, grammar errors, named - entity errors, etc., in order to understand their respective weaknesses. 4. **Providing benchmarks and insights**: Providing a benchmark for future research and helping to understand the current capabilities and limitations of LLMs in translation tasks. Through these objectives, the paper hopes to provide valuable insights in the field of machine translation, especially in terms of the performance comparison between LLMs and human translators.