Small Languages, Big Models: A Study of Continual Training on Languages of Norway

David Samuel,Vladislav Mikhailov,Erik Velldal,Lilja Øvrelid,Lucas Georges Gabriel Charpentier,Andrey Kutuzov
2024-12-09
Abstract:Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Sámi. To address this issue, we present a novel three-stage continual training approach. We also experiment with combining causal and masked language modeling to get more flexible models. Based on our findings, we train, evaluate, and openly release a new large generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.
Computation and Language
What problem does this paper attempt to address?