Large Malaysian Language Model Based on Mistral for Enhanced Local Language Understanding

Husein Zolkepli,Aisyah Razak,Kamarul Adha,Ariff Nazhan
2024-02-04
Abstract:In this paper, we present significant advancements in the pretraining of Mistral 7B, a large-scale language model, using a dataset of 32.6 GB, equivalent to 1.1 billion tokens. We explore the impact of extending the context length, releasing models with context lengths of 4096 and 32768 tokens, and further refining performance with a specialized 16384 context length instruction-tuned model, we called it Malaysian Mistral.
Computation and Language
What problem does this paper attempt to address?