LlamaTurk: Adapting Open-Source Generative Large Language Models for Low-Resource Language

Cagri Toraman
DOI: https://doi.org/10.48550/arXiv.2405.07745
2024-05-13
Abstract:Despite advancements in English-dominant generative large language models, further development is needed for low-resource languages to enhance global accessibility. The primary methods for representing these languages are monolingual and multilingual pretraining. Monolingual pretraining is expensive due to hardware requirements, and multilingual models often have uneven performance across languages. This study explores an alternative solution by adapting large language models, primarily trained on English, to low-resource languages. We assess various strategies, including continual training, instruction fine-tuning, task-specific fine-tuning, and vocabulary extension. The results show that continual training improves language comprehension, as reflected in perplexity scores, and task-specific tuning generally enhances performance of downstream tasks. However, extending the vocabulary shows no substantial benefits. Additionally, while larger models improve task performance with few-shot tuning, multilingual models perform worse than their monolingual counterparts when adapted.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the performance of large - scale generative language models (LLMs) in low - resource languages. Although significant progress has been made in English - dominated generative large - language models at present, further efforts are still needed for the development of low - resource languages to improve global accessibility. The paper explores an alternative solution, that is, adapting large - language models mainly trained with English to low - resource languages through strategies such as continuous training, instruction fine - tuning, task - specific fine - tuning and vocabulary expansion. Specifically, the paper evaluates the impact of these strategies on language understanding (as indicated by the perplexity score) and downstream task performance, and explores the performance of different model sizes and multilingual models in the adaptation process.