Abstract:Multilingual large language models are designed, claimed, and expected to cater to speakers of varied languages. We hypothesise that the current practices of fine-tuning and evaluating these models may not perfectly align with this objective owing to a heavy reliance on translation, which cannot cover language-specific knowledge but can introduce translation defects. It remains unknown whether the nature of the instruction data has an impact on the model output; conversely, it is questionable whether translated test sets can capture such nuances. Due to the often coupled practices of using translated data in both stages, such imperfections could have been overlooked. This work investigates these issues using controlled native or translated data during the instruction tuning and evaluation stages. We show that native or generation benchmarks reveal a notable difference between native and translated instruction data especially when model performance is high, whereas other types of test sets cannot. The comparison between round-trip and single-pass translations reflects the importance of knowledge from language-native resources. Finally, we demonstrate that regularization is beneficial to bridging this gap on structured but not generative tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the effectiveness and limitations of using translated data in the instruction - tuning and evaluation processes of multilingual large language models (LLMs). Specifically, the researchers assume that current tuning and evaluation practices may not be fully aligned with the design goals of multilingual models, because these practices rely too much on translation, which cannot cover language - specific knowledge and may introduce translation flaws. Therefore, the paper aims to explore the following questions: 1. **Does the nature of the instruction data affect the model output?** That is, is there a difference in model performance when using native - language data and translated data for instruction - tuning? 2. **Can a translated test set capture this difference?** If there is a difference between native data and translated data, can the translated test set reflect this? 3. **Do translation flaws or lack of language - specific knowledge cause the performance gap?** Separate these two factors through round - trip translation to explore which factor has a greater impact on model performance. 4. **When translated data must be used, what techniques can be adopted to narrow the performance gap?** For example, does regularization technique help narrow the performance gap between native data and translated data on structured tasks? To answer these questions, the paper designed a series of experiments, using eight models of different scales and data distributions, and evaluated them on nine benchmarks of different natures, including the comparison between translated and native data and the comparison between classification and generation tasks. Empirical results show that on some benchmarks, especially when the model performance is strong, there is a significant performance gap between native data and translated data. In addition, the study also found that round - trip translated data performs better than single - translation data, indicating that lack of language - specific knowledge may be more harmful than translation flaws. Finally, the paper points out that when translated data must be used, regularization technique can narrow the performance gap on structured tasks, but has limited effect on generation tasks.

Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?

Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions?

Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice?

How Many Languages Make Good Multilingual Instruction Tuning? A Case Study on BLOOM

Multilingual Instruction Tuning With Just a Pinch of Multilinguality

Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune?

X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions

Maybe Only 0.5 Training Data Instruction Tuning

Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca

INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models

Multilingual Pretraining and Instruction Tuning Improve Cross-Lingual Knowledge Alignment, But Only Shallowly

Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace

How Multilingual Are Large Language Models Fine-Tuned for Translation?

Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?

Demystifying Instruction Mixing for Fine-tuning Large Language Models

Respond in my Language: Mitigating Language Inconsistency in Response Generation based on Large Language Models

Stronger Models are NOT Stronger Teachers for Instruction Tuning

CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment

From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning

How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes

Eliciting the Translation Ability of Large Language Models via Multilingual Finetuning with Translation Instructions