Guiding In-Context Learning of LLMs through Quality Estimation for Machine Translation

Javad Pourmostafa Roshan Sharami,Dimitar Shterionov,Pieter Spronck
2024-09-18
Abstract:The quality of output from large language models (LLMs), particularly in machine translation (MT), is closely tied to the quality of in-context examples (ICEs) provided along with the query, i.e., the text to translate. The effectiveness of these ICEs is influenced by various factors, such as the domain of the source text, the order in which the ICEs are presented, the number of these examples, and the prompt templates used. Naturally, selecting the most impactful ICEs depends on understanding how these affect the resulting translation quality, which ultimately relies on translation references or human judgment. This paper presents a novel methodology for in-context learning (ICL) that relies on a search algorithm guided by domain-specific quality estimation (QE). Leveraging the XGLM model, our methodology estimates the resulting translation quality without the need for translation references, selecting effective ICEs for MT to maximize translation quality. Our results demonstrate significant improvements over existing ICL methods and higher translation performance compared to fine-tuning a pre-trained language model (PLM), specifically mBART-50.
Computation and Language
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the issue of how the output quality of large language models (LLMs) in machine translation (MT) tasks is affected by in-context examples (ICEs). Specifically, the paper focuses on the following aspects: 1. **Selection of ICEs**: How to select the most effective ICEs to improve translation quality. The effectiveness of ICEs is influenced by various factors such as the domain of the source text, the order of ICEs, the number of ICEs, and the prompt template. 2. **Evaluation of ICEs**: How to evaluate the impact of ICEs on translation quality without reference translations. Traditional evaluation methods rely on reference translations or human judgment, but these methods have limitations in practical applications. 3. **Optimization of Search Algorithms**: How to use search algorithms combined with domain-specific quality estimation (QE) to select the best combination of ICEs to maximize translation quality. The paper proposes a new method that uses the XGLM model for quality estimation, thereby selecting effective ICEs without relying on reference translations. ### Main Contributions 1. **Proposing a New ICL Method**: Combining unsupervised retrievers and domain-specific quality estimation (QE) to select the best combination of ICEs through search algorithms to improve translation quality. 2. **Experimental Validation**: Validating the effectiveness of the proposed method through German-to-English translation experiments. The experimental results show that the method significantly outperforms existing ICL methods and the fine-tuned mBART-50 model on multiple metrics. 3. **Computational Efficiency**: Introducing early stopping conditions to reduce the computational resource consumption during the search process, thereby improving the computational efficiency of the method. ### Experimental Setup and Results - **Dataset**: A dataset containing German-to-English translation pairs in the IT domain was used, including approximately 222k training sentences, 2k development sentences, and 2k test sentences. - **Evaluation Metrics**: BLEU and COMET metrics were used to evaluate translation performance. - **Experimental Modes**: - **Mode 1**: Combining BM25 ranking and QE to evaluate the effectiveness of ICEs. - **Mode 2**: Ranking ICEs based on n-gram overlap and combining QE evaluation. - **Mode 3**: Using the BLEU score of the existing test set as the highest benchmark, although unrealistic, it is used to compare the effectiveness of other modes. ### Results Analysis - **Performance Improvement**: In all experimental modes, the proposed method significantly outperformed existing ICL methods and the fine-tuned mBART-50 model on BLEU and COMET metrics. - **Time Efficiency**: Although the time consumption of Mode 1 and Mode 2 is high, the introduction of early stopping conditions effectively reduced the computation time. - **ICE Quantity Analysis**: Mode 3, having access to the test set, selected the optimal combination of ICEs, resulting in fewer ICEs. Mode 1 and Mode 2, due to the inaccuracy of QE estimation, selected more ICEs. ### Conclusion The paper proposes a new method that combines unsupervised retrievers and domain-specific quality estimation to effectively address the issues of ICE selection and evaluation in LLMs for machine translation tasks. The experimental results show that the method not only improves translation quality but also has high computational efficiency.