Exploring the Correlation between Human and Machine Evaluation of Simultaneous Speech Translation

Xiaoman Wang,Claudio Fantinuoli
2024-06-14
Abstract:Assessing the performance of interpreting services is a complex task, given the nuanced nature of spoken language translation, the strategies that interpreters apply, and the diverse expectations of users. The complexity of this task become even more pronounced when automated evaluation methods are applied. This is particularly true because interpreted texts exhibit less linearity between the source and target languages due to the strategies employed by the interpreter. This study aims to assess the reliability of automatic metrics in evaluating simultaneous interpretations by analyzing their correlation with human evaluations. We focus on a particular feature of interpretation quality, namely translation accuracy or faithfulness. As a benchmark we use human assessments performed by language experts, and evaluate how well sentence embeddings and Large Language Models correlate with them. We quantify semantic similarity between the source and translated texts without relying on a reference translation. The results suggest GPT models, particularly GPT-3.5 with direct prompting, demonstrate the strongest correlation with human judgment in terms of semantic similarity between source and target texts, even when evaluating short textual segments. Additionally, the study reveals that the size of the context window has a notable impact on this correlation.
Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is the consistency between automatic evaluation metrics and human evaluation in assessing the quality of simultaneous interpretation. Specifically, the study aims to explore the following two main questions: 1. **Is there an automatic evaluation metric that can be highly consistent with human judgment and can be used to automatically assess the accuracy of spoken translation?** 2. **Which of these automatic evaluation metrics is more effective in evaluating human-generated translations versus machine-generated translations?** ### Background and Motivation Evaluating the quality of interpretation is a complex and challenging task. Traditionally, this evaluation is usually done manually. Although this method can provide a comprehensive quality assessment, it is labor-intensive, time-consuming, and costly. In recent years, with the development of natural language processing technology, researchers have begun to explore the use of automatic evaluation metrics to assess the quality of interpretation. However, due to the non-linearity and diversity of spoken translation, traditional automatic evaluation metrics (such as BLEU) may not be effective in evaluating spoken translation. ### Research Methods To verify the above questions, the researchers adopted the following methods: 1. **Dataset**: The study used simultaneous interpretation of 12 English speeches into Spanish, each speech lasting about 5 minutes. These speeches covered various scenarios such as lectures, business presentations, live tutorials, and political speeches. 2. **Evaluation Methods**: - **Human Evaluation**: 18 evaluators (including 9 professional interpreters and 9 bilingual individuals) used a six-point Likert scale to assess the accuracy of the translations. During the evaluation process, the evaluators did not know whether the translation was generated by humans or machines. - **Automatic Evaluation**: Three neural network models (all-MiniLM-L6-v2, GPT-Ada, USEM) and the direct prompting function of GPT-3.5 were used to calculate the semantic similarity between the source text and the translation. 3. **Correlation Analysis**: The Pearson correlation coefficient was calculated to compare the correlation between human evaluation and automatic evaluation results. ### Main Findings 1. **GPT-3.5 Performed Best**: GPT-3.5 performed the best under direct prompting, with the highest median correlation and more consistent correlation values compared to other methods. 2. **Impact of Window Size**: The study found that window size (i.e., the number of combined paragraphs) significantly affected the automatic evaluation results. For human-generated translations, GPT-3.5 performed better with larger window sizes; for machine-generated translations, all-MiniLM-L6-v2 performed better with larger window sizes. 3. **Comparison of Different Evaluation Methods**: GPT-3.5 showed high correlation in evaluating both human-generated and machine-generated translations, while the performance of other models varied. ### Conclusion The study results indicate that GPT-3.5 has the highest correlation with human judgment in evaluating the quality of simultaneous interpretation, especially when evaluating short text segments. This provides strong support for the development of more effective automatic evaluation tools. However, the study also points out that the performance of automatic evaluation metrics in evaluating larger text segments still needs further optimization. Additionally, the study emphasizes the need to consider ethical issues such as privacy protection and fairness in practical applications.