Abstract:In recent times, substantial advancements have been witnessed in large language models (LLMs), exemplified by ChatGPT, showcasing remarkable proficiency across a range of complex tasks. However, many mainstream LLMs (e.g. LLaMA) are pretrained on English-dominant corpus, which limits their performance in other non-English languages. In this paper, we focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. To answer this question, we conduct an extensive empirical investigation based on LLaMA, accumulating over 1440 GPU hours. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. To accurately assess the model's level of knowledge, we employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench. Furthermore, a comprehensive evaluation of the model's response quality is conducted, considering aspects such as accuracy, fluency, informativeness, logical coherence, and harmlessness, based on LLM-Eval, a benchmarks consisting instruction tasks from 17 diverse categories. Our evaluation results demonstrate that comparable performance to state-of-the-art transfer models can be achieved with less than 1% of the pretraining data, both in terms of knowledge alignment and response quality. Furthermore, the experimental outcomes across the thirteen low-resource languages also exhibit similar trends. We anticipate that the conclusions revealed by the experiments will aid the community in developing non-English LLMs.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to effectively transfer the language generation and instruction - following abilities of large - language models (LLMs) such as LLaMA from English to non - English languages. Specifically, the research focuses on the following points: 1. **The impact of vocabulary expansion**: - The research explored the impact of expanding the vocabulary on the model performance during the transfer process. The experimental results show that in incremental pre - training on a scale of billions, expanding the vocabulary may not be the best option. 2. **The training scale required for effective transfer**: - The research analyzed the impact of further pre - training of different scales (taking Chinese as an example) on the model's knowledge level and response quality. The results show that pre - training data on a scale of billions is not sufficient to significantly improve the model's knowledge level, while instruction fine - tuning can significantly improve the response quality with only tens of thousands of instruction data. 3. **The impact of transfer training on the original English ability**: - The research evaluated whether relying solely on the target - language corpus for transfer training would damage the model's original English ability. The results show that using the target - language corpus alone will significantly weaken the model's English ability, but multilingual joint training can effectively alleviate this problem. 4. **The universality of multilingual transfer**: - The research has also been extended to 13 low - resource languages to verify whether the above conclusions are applicable to other non - English languages. The experimental results show that as the amount of instruction fine - tuning data increases, the response quality of all low - resource languages has been significantly improved, and some low - resource languages with higher usage frequencies perform better. Through these studies, the paper aims to provide guidance for the community to help build efficient non - English LLMs and achieve effective transfer of language abilities at the minimum cost. ### Main conclusions - **Vocabulary expansion**: In incremental pre - training on a scale of billions, expanding the vocabulary may not be the optimal choice. - **Training scale**: Large - scale pre - training data has limited improvement on the knowledge level, and instruction fine - tuning can significantly improve the response quality with only a small amount of data. - **Multilingual joint training**: To avoid the decline of the original language ability caused by single - corpus training, multilingual joint training is a better option. - **Applicability of low - resource languages**: The experimental results are applicable to a variety of low - resource languages, verifying the universality of the transfer method. ### Formula representation The formulas involved in the paper include the loss functions of pre - training and instruction fine - tuning, which are represented in Markdown format as follows: #### Pre - training loss function \[ L_{\text{pretrain}} = \sum_{x \in D} \sum_i -\log p_\theta(x_i | x_1,..., x_{i - 1}) \] #### Instruction fine - tuning loss function \[ L_{\text{ins}} = -\log p_\theta(Y | I) \] where: - \(D\) represents a large - scale corpus, - \(x=\{x_1,..., x_n\}\) represents the input token sequence, - \(I\) represents the task instruction, - \(Y\) represents the desired response. These formulas are used to describe the optimization goals of the model in the pre - training and instruction fine - tuning stages.

LLaMA Beyond English: An Empirical Study on Language Capability Transfer

BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models

Extrapolating Large Language Models to Non-English by Aligning Languages

LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback

Don't Trust ChatGPT when Your Question is not in English: A Study of Multilingual Abilities and Types of LLMs

Language Versatilists vs. Specialists: An Empirical Revisiting on Multilingual Transfer Ability

BigTranslate: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages

BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment

Evaluating Performance of LLaMA2 Large Language Model Enhanced by QLoRA Fine-Tuning for English Grammatical Error Correction.

Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners

Bridging the Language Gap: Enhancing Multilingual Prompt-Based Code Generation in LLMs via Zero-Shot Cross-Lingual Transfer

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Why Not Transform Chat Large Language Models to Non-English?

LLaMA Pro: Progressive LLaMA with Block Expansion

Extracting and Transferring Abilities For Building Multi-lingual Ability-enhanced Large Language Models

Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca

MaLA-500: Massive Language Adaptation of Large Language Models

Supervised Knowledge Makes Large Language Models Better In-context Learners

XTransplant: A Probe into the Upper Bound Performance of Multilingual Capability and Culture Adaptability in LLMs via Mutual Cross-lingual Feed-forward Transplantation

Empowering Cross-lingual Abilities of Instruction-tuned Large Language Models by Translation-following demonstrations

Better to Ask in English: Evaluation of Large Language Models on English, Low-resource and Cross-Lingual Settings