Abstract:The large language model (LLM) has garnered significant attention due to its in-context learning mechanisms and emergent capabilities. The research community has conducted several pilot studies to apply LLMs to machine translation tasks and evaluate their performance from diverse perspectives. However, previous research has primarily focused on the LLM itself and has not explored human intervention in the inference process of LLM. The characteristics of LLM, such as in-context learning and prompt engineering, closely mirror human cognitive abilities in language tasks, offering an intuitive solution for human-in-the-loop generation. In this study, we propose a human-in-the-loop pipeline that guides LLMs to produce customized outputs with revision instructions. The pipeline initiates by prompting the LLM to produce a draft translation, followed by the utilization of automatic retrieval or human feedback as supervision signals to enhance the LLM's translation through in-context learning. The human-machine interactions generated in this pipeline are also stored in an external database to expand the in-context retrieval database, enabling us to leverage human supervision in an offline setting. We evaluate the proposed pipeline using GPT-3.5-turbo API on five domain-specific benchmarks for German-English translation. The results demonstrate the effectiveness of the pipeline in tailoring in-domain translations and improving translation performance compared to direct translation. Additionally, we discuss the results from the following perspectives: 1) the effectiveness of different in-context retrieval methods; 2) the construction of a retrieval database under low-resource scenarios; 3) the observed domains differences; 4) the quantitative analysis of linguistic statistics; and 5) the qualitative analysis of translation cases. The code and data are available at <a class="link-external link-https" href="https://github.com/NLP2CT/HIL-MT/" rel="external noopener nofollow">this https URL</a>.

MT-LENS: An all-in-one Toolkit for Better Machine Translation Evaluation

LEPOR: An Augmented Machine Translation Evaluation Metric

Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets

The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

TasTe: Teaching Large Language Models to Translate through Self-Reflection

adaptMLLM: Fine-Tuning Multilingual Language Models on Low-Resource Languages with Integrated LLM Playgrounds

What do Large Language Models Need for Machine Translation Evaluation?

An Overview on Machine Translation Evaluation

A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task

Ladder: A Model-Agnostic Framework Boosting LLM-based Machine Translation to the Next Level

TEaR: Improving LLM-based Machine Translation with Systematic Self-Refinement

MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models

Adaptive Machine Translation with Large Language Models

A Multi-task Learning Framework for Evaluating Machine Translation of Emotion-loaded User-generated Content

To Translate or Not to Translate: A Systematic Investigation of Translation-Based Cross-Lingual Transfer to Low-Resource Languages

Human-in-the-loop Machine Translation with Large Language Model

MTUncertainty: Assessing the Need for Post-editing of Machine Translation Outputs by Fine-tuning OpenAI LLMs

T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

Instruction-Tuned LLMs Succeed in Document-Level MT Without Fine-Tuning -- But BLEU Turns a Blind Eye

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection