Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?

Pinzhen Chen,Simon Yu,Zhicheng Guo,Barry Haddow
2024-09-27
Abstract:Multilingual large language models are designed, claimed, and expected to cater to speakers of varied languages. We hypothesise that the current practices of fine-tuning and evaluating these models may not perfectly align with this objective owing to a heavy reliance on translation, which cannot cover language-specific knowledge but can introduce translation defects. It remains unknown whether the nature of the instruction data has an impact on the model output; conversely, it is questionable whether translated test sets can capture such nuances. Due to the often coupled practices of using translated data in both stages, such imperfections could have been overlooked. This work investigates these issues using controlled native or translated data during the instruction tuning and evaluation stages. We show that native or generation benchmarks reveal a notable difference between native and translated instruction data especially when model performance is high, whereas other types of test sets cannot. The comparison between round-trip and single-pass translations reflects the importance of knowledge from language-native resources. Finally, we demonstrate that regularization is beneficial to bridging this gap on structured but not generative tasks.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the effectiveness and limitations of using translated data in the instruction - tuning and evaluation processes of multilingual large language models (LLMs). Specifically, the researchers assume that current tuning and evaluation practices may not be fully aligned with the design goals of multilingual models, because these practices rely too much on translation, which cannot cover language - specific knowledge and may introduce translation flaws. Therefore, the paper aims to explore the following questions: 1. **Does the nature of the instruction data affect the model output?** That is, is there a difference in model performance when using native - language data and translated data for instruction - tuning? 2. **Can a translated test set capture this difference?** If there is a difference between native data and translated data, can the translated test set reflect this? 3. **Do translation flaws or lack of language - specific knowledge cause the performance gap?** Separate these two factors through round - trip translation to explore which factor has a greater impact on model performance. 4. **When translated data must be used, what techniques can be adopted to narrow the performance gap?** For example, does regularization technique help narrow the performance gap between native data and translated data on structured tasks? To answer these questions, the paper designed a series of experiments, using eight models of different scales and data distributions, and evaluated them on nine benchmarks of different natures, including the comparison between translated and native data and the comparison between classification and generation tasks. Empirical results show that on some benchmarks, especially when the model performance is strong, there is a significant performance gap between native data and translated data. In addition, the study also found that round - trip translated data performs better than single - translation data, indicating that lack of language - specific knowledge may be more harmful than translation flaws. Finally, the paper points out that when translated data must be used, regularization technique can narrow the performance gap on structured tasks, but has limited effect on generation tasks.