Abstract:Despite rapid progress in large language models (LLMs), their performance on a vast majority of languages remains unsatisfactory. In this paper, we study building language-specific LLMs by adapting monolingual and multilingual LLMs. We conduct systematic experiments on how design choices (base model selection, vocabulary extension, and continued pretraining) impact the adapted LLM, both in terms of efficiency (how many tokens are needed to encode the same amount of information) and end task performance. We find that (1) the initial performance of LLM does not always correlate with the final performance after the adaptation. Adapting an English-centric models can yield better results than adapting multilingual models despite their worse initial performance on low-resource languages. (2) Efficiency can easily improved with simple vocabulary extension and continued pretraining in most LLMs we study, and (3) The optimal adaptation method (choice of the base model, new vocabulary size, training data, initialization strategy) is highly language-dependent, and the simplest embedding initialization works well across various experimental settings. Together, our work lays foundations on efficiently building language-specific LLMs by adapting existing LLMs.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Despite the rapid progress made by large - language models (LLMs), their performance in most languages is still not satisfactory. In particular, for languages with fewer resources, these models often perform poorly. This paper studies the methods of constructing language - specific LLMs by adapting monolingual and multilingual LLMs, and systematically explores how design choices (the choice of base models, vocabulary expansion, continuous pre - training) affect the adapted LLM, especially in terms of efficiency (the number of tokens required to encode the same information) and final - task performance.
### Main problems:
1. **The relationship between initial performance and final performance**:
- The study found that the initial performance of an LLM before adaptation is not always related to its final performance. For example, even if the initial performance on a low - resource language is poor, the effect of adapting an English - centric model (such as LLaMA - 2) may be better than that of adapting a multilingual model.
2. **Efficiency improvement**:
- Through simple vocabulary expansion and continuous pre - training, the efficiency of most LLMs can be significantly improved. Specifically, a moderate vocabulary expansion (such as 10,000 new words) is sufficient to narrow the efficiency gap between English and other low - resource languages.
3. **The choice of the optimal adaptation method**:
- The optimal adaptation method (the choice of base models, the amount of new vocabulary, training data, initialization strategy) is highly dependent on the target language. Experimental results show that the simplest embedding initialization method performs well in various experimental settings.
### Specific research contents:
- **Vocabulary expansion**: By increasing the vocabulary of the target language, improve the efficiency of the model in encoding information.
- **Continuous pre - training**: By continuing pre - training on the data set of the target language, improve the performance of the model on the final task.
- **Base model selection**: Study the impact of different base models (such as XGLM - 7.5B, Gemma - 7B, Bloom - 7.1B, LLaMA - 2 - 7B) on the adaptation effect.
### Experimental settings:
- **Target languages**: Hindi, Turkish, Arabic and Tamil.
- **Evaluation tasks**: Including machine translation, title generation, knowledge probing, natural language inference, sentiment analysis, causal / common - sense reasoning, etc.
- **Evaluation metrics**: Efficiency metrics (such as fertility, that is, the average number of tokens required to encode the same information) and task - performance metrics (such as spBLEU, accuracy).
### Main findings:
- **The relationship between initial performance and final performance**: Performance before adaptation does not always predict final performance. Adapting an English - centric model can achieve results comparable to or even better than those of a multilingual model.
- **The effectiveness of vocabulary expansion**: A moderate vocabulary expansion (10,000) can significantly improve efficiency and reduce the number of tokens required to encode information.
- **The importance of embedding initialization**: Initialization using a simple method (such as the mean of new word embeddings) has an effect comparable to more complex methods.
- **The specificity of languages and base models**: The adaptation performance is highly dependent on the target language and the pre - training corpus of the base model.
In conclusion, this paper provides a basis for efficiently constructing language - specific LLMs, and through systematic research and experiments, provides guidance for adapting existing LLMs.