Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice?

Dawei Zhu,Pinzhen Chen,Miaoran Zhang,Barry Haddow,Xiaoyu Shen,Dietrich Klakow
2024-10-04
Abstract:Traditionally, success in multilingual machine translation can be attributed to three key factors in training data: large volume, diverse translation directions, and high quality. In the current practice of fine-tuning large language models (LLMs) for translation, we revisit the importance of these factors. We find that LLMs display strong translation capability after being fine-tuned on as few as 32 parallel sentences and that fine-tuning on a single translation direction enables translation in multiple directions. However, the choice of direction is critical: fine-tuning LLMs with only English on the target side can lead to task misinterpretation, which hinders translation into non-English languages. Problems also arise when noisy synthetic data is placed on the target side, especially when the target language is well-represented in LLM pre-training. Yet interestingly, synthesized data in an under-represented language has a less pronounced effect. Our findings suggest that when adapting LLMs to translation, the requirement on data quantity can be eased but careful considerations are still crucial to prevent an LLM from exploiting unintended data biases.
Computation and Language
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper mainly explores whether these models can be effectively aligned by a small amount of and possibly noisy data when fine - tuning large - scale language models (LLMs) for machine translation (MT) tasks. Specifically, the paper attempts to answer the following key questions: 1. **Data volume requirements**: Traditionally, the success of multilingual machine translation depends on a large amount of diverse parallel corpora and high - quality data. However, the paper studies whether the required data volume can be significantly reduced when fine - tuning LLMs, and even whether effective translation performance can be achieved with only 32 parallel sentences. 2. **Impact of a single translation direction**: The paper examines whether fine - tuning in only one translation direction can enable the model to effectively translate language pairs in multiple directions. The study finds that choosing the correct translation direction is crucial, especially avoiding using English as the target language, because this may lead to task misinterpretation. 3. **Impact of synthetic data**: The paper also explores the effect of using low - quality synthetic data (such as data generated by back - translation or word - by - word translation) for fine - tuning. The results show that the quality of synthetic data has a significant impact on model performance. In particular, when noise is introduced on the target language side, it will lead to performance degradation. However, for resource - poor languages, the impact of synthetic data is smaller. ### Main research contents - **Data efficiency**: The impact of different amounts of fine - tuning data (from 32 to 4,096 samples) on translation performance was studied, and it was found that in some cases, a small amount of high - quality parallel data can significantly improve translation results. - **Selection of translation direction**: The effect of fine - tuning in only one translation direction was explored, and the generalization ability between different language pairs was analyzed. The results show that avoiding using English as the target language can prevent task misinterpretation and thus improve translation performance. - **Role of synthetic data**: The impact of low - quality synthetic data (such as back - translation and word - by - word translation) on model performance was evaluated, and it was found that the quality of synthetic data has a greater impact on the target language side, but for resource - poor languages, the model shows stronger robustness. ### Conclusion The main conclusion of the paper is that when fine - tuning LLMs for translation tasks, a small amount of high - quality parallel data can significantly improve translation performance, but it is necessary to pay attention to choosing the appropriate translation direction to avoid task misinterpretation. In addition, the quality of synthetic data has an important impact on model performance. In particular, when the target language has a good representation in pre - training, low - quality synthetic data may lead to performance degradation.