Abstract:Despite rapid progress in large language models (LLMs), their performance on a vast majority of languages remains unsatisfactory. In this paper, we study building language-specific LLMs by adapting monolingual and multilingual LLMs. We conduct systematic experiments on how design choices (base model selection, vocabulary extension, and continued pretraining) impact the adapted LLM, both in terms of efficiency (how many tokens are needed to encode the same amount of information) and end task performance. We find that (1) the initial performance of LLM does not always correlate with the final performance after the adaptation. Adapting an English-centric models can yield better results than adapting multilingual models despite their worse initial performance on low-resource languages. (2) Efficiency can easily improved with simple vocabulary extension and continued pretraining in most LLMs we study, and (3) The optimal adaptation method (choice of the base model, new vocabulary size, training data, initialization strategy) is highly language-dependent, and the simplest embedding initialization works well across various experimental settings. Together, our work lays foundations on efficiently building language-specific LLMs by adapting existing LLMs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Despite the rapid progress made by large - language models (LLMs), their performance in most languages is still not satisfactory. In particular, for languages with fewer resources, these models often perform poorly. This paper studies the methods of constructing language - specific LLMs by adapting monolingual and multilingual LLMs, and systematically explores how design choices (the choice of base models, vocabulary expansion, continuous pre - training) affect the adapted LLM, especially in terms of efficiency (the number of tokens required to encode the same information) and final - task performance. ### Main problems: 1. **The relationship between initial performance and final performance**: - The study found that the initial performance of an LLM before adaptation is not always related to its final performance. For example, even if the initial performance on a low - resource language is poor, the effect of adapting an English - centric model (such as LLaMA - 2) may be better than that of adapting a multilingual model. 2. **Efficiency improvement**: - Through simple vocabulary expansion and continuous pre - training, the efficiency of most LLMs can be significantly improved. Specifically, a moderate vocabulary expansion (such as 10,000 new words) is sufficient to narrow the efficiency gap between English and other low - resource languages. 3. **The choice of the optimal adaptation method**: - The optimal adaptation method (the choice of base models, the amount of new vocabulary, training data, initialization strategy) is highly dependent on the target language. Experimental results show that the simplest embedding initialization method performs well in various experimental settings. ### Specific research contents: - **Vocabulary expansion**: By increasing the vocabulary of the target language, improve the efficiency of the model in encoding information. - **Continuous pre - training**: By continuing pre - training on the data set of the target language, improve the performance of the model on the final task. - **Base model selection**: Study the impact of different base models (such as XGLM - 7.5B, Gemma - 7B, Bloom - 7.1B, LLaMA - 2 - 7B) on the adaptation effect. ### Experimental settings: - **Target languages**: Hindi, Turkish, Arabic and Tamil. - **Evaluation tasks**: Including machine translation, title generation, knowledge probing, natural language inference, sentiment analysis, causal / common - sense reasoning, etc. - **Evaluation metrics**: Efficiency metrics (such as fertility, that is, the average number of tokens required to encode the same information) and task - performance metrics (such as spBLEU, accuracy). ### Main findings: - **The relationship between initial performance and final performance**: Performance before adaptation does not always predict final performance. Adapting an English - centric model can achieve results comparable to or even better than those of a multilingual model. - **The effectiveness of vocabulary expansion**: A moderate vocabulary expansion (10,000) can significantly improve efficiency and reduce the number of tokens required to encode information. - **The importance of embedding initialization**: Initialization using a simple method (such as the mean of new word embeddings) has an effect comparable to more complex methods. - **The specificity of languages and base models**: The adaptation performance is highly dependent on the target language and the pre - training corpus of the base model. In conclusion, this paper provides a basis for efficiently constructing language - specific LLMs, and through systematic research and experiments, provides guidance for adapting existing LLMs.

Exploring Design Choices for Building Language-Specific LLMs

A Three-Pronged Approach to Cross-Lingual Adaptation with Multilingual LLMs

Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners

SambaLingo: Teaching Large Language Models New Languages

Efficiently Adapting Pretrained Language Models To New Languages

Bilingual Adaptation of Monolingual Foundation Models

A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments

Improving Bilingual Capabilities of Language Models to Support Diverse Linguistic Practices in Education

Bridging the Gap: Dynamic Learning Strategies for Improving Multilingual Performance in LLMs

From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages

How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?

How do Large Language Models Handle Multilingualism?

Beyond English-Centric LLMs: What Language Do Multilingual Language Models Think in?

Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities

Exploring Continual Fine-Tuning for Enhancing Language Ability in Large Language Model

Adapting Large Language Models for Document-Level Machine Translation

Optimizing Low-Resource Language Model Training: Comprehensive Analysis of Multi-Epoch, Multi-Lingual, and Two-Stage Approaches

Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation

When Life gives you LLMs, make LLM-ADE: Large Language Models with Adaptive Data Engineering

Could We Have Had Better Multilingual LLMs If English Was Not the Central Language?