Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis

Wenhao Zhu,Hongyi Liu,Qingxiu Dong,Jingjing Xu,Shujian Huang,Lingpeng Kong,Jiajun Chen,Lei Li
2024-06-14
Abstract:Large language models (LLMs) have demonstrated remarkable potential in handling multilingual machine translation (MMT). In this paper, we systematically investigate the advantages and challenges of LLMs for MMT by answering two questions: 1) How well do LLMs perform in translating massive languages? 2) Which factors affect LLMs' performance in translation? We thoroughly evaluate eight popular LLMs, including ChatGPT and GPT-4. Our empirical results show that translation capabilities of LLMs are continually involving. GPT-4 has beat the strong supervised baseline NLLB in 40.91% of translation directions but still faces a large gap towards the commercial translation system like Google Translate, especially on low-resource languages. Through further analysis, we discover that LLMs exhibit new working patterns when used for MMT. First, LLM can acquire translation ability in a resource-efficient way and generate moderate translation even on zero-resource languages. Second, instruction semantics can surprisingly be ignored when given in-context exemplars. Third, cross-lingual exemplars can provide better task guidance for low-resource translation than exemplars in the same language pairs. Code will be released at: <a class="link-external link-https" href="https://github.com/NJUNLP/MMT-LLM" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly include two aspects: 1. **Performance of large - scale multilingual machine translation**: The paper aims to evaluate how large language models (LLMs) perform in multilingual machine translation (MMT) tasks involving a large number of languages. Specifically, researchers hope to understand whether these models can effectively translate between multiple languages, especially for languages with fewer resources. 2. **Factors affecting the translation performance of LLMs**: In addition to evaluating performance, researchers also hope to experimentally analyze which factors will affect the performance of LLMs in multilingual machine translation. This includes, but is not limited to, the size of the pre - training corpus, the design of context templates, and the selection of context examples. To answer these questions, researchers have carried out the following work: - **Evaluating multiple popular large - scale language models**: Researchers selected eight popular LLMs, including ChatGPT and GPT - 4, and systematically evaluated their performance in 102 languages and 606 translation directions. - **Comparing with supervised baseline models**: Researchers compared the performance of LLMs with three powerful supervised baseline models (M2M - 100, NLLB, and Google Translate), revealing the gaps between different translation paradigms. - **In - depth analysis of factors affecting translation performance**: Through experiments, researchers have discovered some new working patterns. For example, LLMs can acquire translation capabilities in the case of limited resources and can even generate translations of medium quality on zero - resource languages. In addition, cross - language examples can provide better task guidance for the translation of low - resource languages. Through these studies, the paper not only shows the potential of LLMs in the field of multilingual machine translation but also points out the current challenges, especially that the performance on low - resource languages still needs to be improved.