Machine Translation with Large Language Models: Prompt Engineering for Persian, English, and Russian Directions

Nooshin Pourkamali,Shler Ebrahim Sharifi
2024-01-16
Abstract:Generative large language models (LLMs) have demonstrated exceptional proficiency in various natural language processing (NLP) tasks, including machine translation, question answering, text summarization, and natural language understanding.
Computation and Language,Artificial Intelligence,Human-Computer Interaction,Machine Learning
What problem does this paper attempt to address?
This paper discusses how to use large language models (LLMs) such as PaLM, GPT, etc. for machine translation (MT), focusing on two prompt methods and their combination in Persian, English, and Russian. In the study, the authors improve the performance of LLMs through n-shot feeding and customized prompt frameworks, and analyze the effects of different models, tasks, source languages, and target languages on the results. They found that although LLMs can generate human-like translation outputs, their performance is affected by prompt methods, language pairs, and context-learning instance selection. The paper also emphasizes the efficiency of LLMs in handling multilingual translation tasks, especially for high-resource languages such as English and Russian, and in some cases, even for low-resource languages like Persian. However, LLMs are not stable MT tools and may produce various linguistic, literary errors and illusions, and may not be ideal in combining multiple translations. In addition, the study points out that zero-shot prompt enhancement scenarios are generally more accurate and fluent than n-shot scenarios. The paper evaluates translation results using automatic evaluation metrics (such as BLEU, chrF++, and COMET) as well as human evaluation, analyzing and categorizing errors. The results show that PaLM performs best in handling long texts and subtle style differences, which may be attributed to the inclusion of a large amount of multilingual data in its training data. Other models such as GPT-3.5, GPT-4, Claude, meta-llama/Llama-2-70b, and Perplexity ai + Copilot show different effects under different settings. Overall, this paper aims to provide preliminary experience for the proper use of LLMs in machine translation and proposes methods to design prompts to enhance the accuracy and reliability of natural language processing tasks.