LLaMA Beyond English: An Empirical Study on Language Capability Transfer

Jun Zhao,Zhihao Zhang,Luhui Gao,Qi Zhang,Tao Gui,Xuanjing Huang
2024-01-12
Abstract:In recent times, substantial advancements have been witnessed in large language models (LLMs), exemplified by ChatGPT, showcasing remarkable proficiency across a range of complex tasks. However, many mainstream LLMs (e.g. LLaMA) are pretrained on English-dominant corpus, which limits their performance in other non-English languages. In this paper, we focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. To answer this question, we conduct an extensive empirical investigation based on LLaMA, accumulating over 1440 GPU hours. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. To accurately assess the model's level of knowledge, we employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench. Furthermore, a comprehensive evaluation of the model's response quality is conducted, considering aspects such as accuracy, fluency, informativeness, logical coherence, and harmlessness, based on LLM-Eval, a benchmarks consisting instruction tasks from 17 diverse categories. Our evaluation results demonstrate that comparable performance to state-of-the-art transfer models can be achieved with less than 1% of the pretraining data, both in terms of knowledge alignment and response quality. Furthermore, the experimental outcomes across the thirteen low-resource languages also exhibit similar trends. We anticipate that the conclusions revealed by the experiments will aid the community in developing non-English LLMs.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to effectively transfer the language generation and instruction - following abilities of large - language models (LLMs) such as LLaMA from English to non - English languages. Specifically, the research focuses on the following points: 1. **The impact of vocabulary expansion**: - The research explored the impact of expanding the vocabulary on the model performance during the transfer process. The experimental results show that in incremental pre - training on a scale of billions, expanding the vocabulary may not be the best option. 2. **The training scale required for effective transfer**: - The research analyzed the impact of further pre - training of different scales (taking Chinese as an example) on the model's knowledge level and response quality. The results show that pre - training data on a scale of billions is not sufficient to significantly improve the model's knowledge level, while instruction fine - tuning can significantly improve the response quality with only tens of thousands of instruction data. 3. **The impact of transfer training on the original English ability**: - The research evaluated whether relying solely on the target - language corpus for transfer training would damage the model's original English ability. The results show that using the target - language corpus alone will significantly weaken the model's English ability, but multilingual joint training can effectively alleviate this problem. 4. **The universality of multilingual transfer**: - The research has also been extended to 13 low - resource languages to verify whether the above conclusions are applicable to other non - English languages. The experimental results show that as the amount of instruction fine - tuning data increases, the response quality of all low - resource languages has been significantly improved, and some low - resource languages with higher usage frequencies perform better. Through these studies, the paper aims to provide guidance for the community to help build efficient non - English LLMs and achieve effective transfer of language abilities at the minimum cost. ### Main conclusions - **Vocabulary expansion**: In incremental pre - training on a scale of billions, expanding the vocabulary may not be the optimal choice. - **Training scale**: Large - scale pre - training data has limited improvement on the knowledge level, and instruction fine - tuning can significantly improve the response quality with only a small amount of data. - **Multilingual joint training**: To avoid the decline of the original language ability caused by single - corpus training, multilingual joint training is a better option. - **Applicability of low - resource languages**: The experimental results are applicable to a variety of low - resource languages, verifying the universality of the transfer method. ### Formula representation The formulas involved in the paper include the loss functions of pre - training and instruction fine - tuning, which are represented in Markdown format as follows: #### Pre - training loss function \[ L_{\text{pretrain}} = \sum_{x \in D} \sum_i -\log p_\theta(x_i | x_1,..., x_{i - 1}) \] #### Instruction fine - tuning loss function \[ L_{\text{ins}} = -\log p_\theta(Y | I) \] where: - \(D\) represents a large - scale corpus, - \(x=\{x_1,..., x_n\}\) represents the input token sequence, - \(I\) represents the task instruction, - \(Y\) represents the desired response. These formulas are used to describe the optimization goals of the model in the pre - training and instruction fine - tuning stages.