Abstract:The quality of output from large language models (LLMs), particularly in machine translation (MT), is closely tied to the quality of in-context examples (ICEs) provided along with the query, i.e., the text to translate. The effectiveness of these ICEs is influenced by various factors, such as the domain of the source text, the order in which the ICEs are presented, the number of these examples, and the prompt templates used. Naturally, selecting the most impactful ICEs depends on understanding how these affect the resulting translation quality, which ultimately relies on translation references or human judgment. This paper presents a novel methodology for in-context learning (ICL) that relies on a search algorithm guided by domain-specific quality estimation (QE). Leveraging the XGLM model, our methodology estimates the resulting translation quality without the need for translation references, selecting effective ICEs for MT to maximize translation quality. Our results demonstrate significant improvements over existing ICL methods and higher translation performance compared to fine-tuning a pre-trained language model (PLM), specifically mBART-50.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the issue of how the output quality of large language models (LLMs) in machine translation (MT) tasks is affected by in-context examples (ICEs). Specifically, the paper focuses on the following aspects: 1. **Selection of ICEs**: How to select the most effective ICEs to improve translation quality. The effectiveness of ICEs is influenced by various factors such as the domain of the source text, the order of ICEs, the number of ICEs, and the prompt template. 2. **Evaluation of ICEs**: How to evaluate the impact of ICEs on translation quality without reference translations. Traditional evaluation methods rely on reference translations or human judgment, but these methods have limitations in practical applications. 3. **Optimization of Search Algorithms**: How to use search algorithms combined with domain-specific quality estimation (QE) to select the best combination of ICEs to maximize translation quality. The paper proposes a new method that uses the XGLM model for quality estimation, thereby selecting effective ICEs without relying on reference translations. ### Main Contributions 1. **Proposing a New ICL Method**: Combining unsupervised retrievers and domain-specific quality estimation (QE) to select the best combination of ICEs through search algorithms to improve translation quality. 2. **Experimental Validation**: Validating the effectiveness of the proposed method through German-to-English translation experiments. The experimental results show that the method significantly outperforms existing ICL methods and the fine-tuned mBART-50 model on multiple metrics. 3. **Computational Efficiency**: Introducing early stopping conditions to reduce the computational resource consumption during the search process, thereby improving the computational efficiency of the method. ### Experimental Setup and Results - **Dataset**: A dataset containing German-to-English translation pairs in the IT domain was used, including approximately 222k training sentences, 2k development sentences, and 2k test sentences. - **Evaluation Metrics**: BLEU and COMET metrics were used to evaluate translation performance. - **Experimental Modes**: - **Mode 1**: Combining BM25 ranking and QE to evaluate the effectiveness of ICEs. - **Mode 2**: Ranking ICEs based on n-gram overlap and combining QE evaluation. - **Mode 3**: Using the BLEU score of the existing test set as the highest benchmark, although unrealistic, it is used to compare the effectiveness of other modes. ### Results Analysis - **Performance Improvement**: In all experimental modes, the proposed method significantly outperformed existing ICL methods and the fine-tuned mBART-50 model on BLEU and COMET metrics. - **Time Efficiency**: Although the time consumption of Mode 1 and Mode 2 is high, the introduction of early stopping conditions effectively reduced the computation time. - **ICE Quantity Analysis**: Mode 3, having access to the test set, selected the optimal combination of ICEs, resulting in fewer ICEs. Mode 1 and Mode 2, due to the inaccuracy of QE estimation, selected more ICEs. ### Conclusion The paper proposes a new method that combines unsupervised retrievers and domain-specific quality estimation to effectively address the issues of ICE selection and evaluation in LLMs for machine translation tasks. The experimental results show that the method not only improves translation quality but also has high computational efficiency.

Guiding In-Context Learning of LLMs through Quality Estimation for Machine Translation

An Empirical Study of In-context Learning in LLMs for Machine Translation

Towards Making the Most of LLM for Translation Quality Estimation.

From Handcrafted Features to LLMs: A Brief Survey for Machine Translation Quality Estimation

Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs

The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

Guiding Large Language Models to Post-Edit Machine Translation with Error Annotations

Refining Translations with LLMs: A Constraint-Aware Iterative Prompting Approach

Efficiently Exploring Large Language Models for Document-Level Machine Translation with In-context Learning

GrammaMT: Improving Machine Translation with Grammar-Informed In-Context Learning

Enhancing Machine Translation Quality Estimation via Fine-Grained Error Analysis and Large Language Model

Improving the Confidence of Machine Translation Quality Estimates.

Instruction-Tuned LLMs Succeed in Document-Level MT Without Fine-Tuning -- But BLEU Turns a Blind Eye

MTUncertainty: Assessing the Need for Post-editing of Machine Translation Outputs by Fine-tuning OpenAI LLMs

In-Context Example Selection via Similarity Search Improves Low-Resource Machine Translation

Misconfidence-based Demonstration Selection for LLM In-Context Learning

Knowledge-Prompted Estimator: A Novel Approach to Explainable Machine Translation Assessment

CATER: Leveraging LLM to Pioneer a Multidimensional, Reference-Independent Paradigm in Translation Quality Evaluation

LLMs Are In-Context Reinforcement Learners

Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT.