Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages

Samuel Cahyawijaya,Holy Lovenia,Fajri Koto,Rifki Afina Putri,Emmanuel Dave,Jhonson Lee,Nuur Shadieq,Wawan Cenggoro,Salsabil Maulana Akbar,Muhammad Ihza Mahendra,Dea Annisayanti Putri,Bryan Wilie,Genta Indra Winata,Alham Fikri Aji,Ayu Purwarianti,Pascale Fung
2024-07-08
Abstract:Large language models (LLMs) show remarkable human-like capability in various domains and languages. However, a notable quality gap arises in low-resource languages, e.g., Indonesian indigenous languages, rendering them ineffective and inefficient in such linguistic contexts. To bridge this quality gap, we introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder-decoder architectures across a range of model sizes. We highlight Cendol's effectiveness across a diverse array of tasks, attaining 20% improvement, and demonstrate its capability to generalize to unseen tasks and indigenous languages of Indonesia. Furthermore, Cendol models showcase improved human favorability despite their limitations in capturing indigenous knowledge and cultural values in Indonesia. In addition, we discuss the shortcomings of parameter-efficient tunings, such as LoRA, for language adaptation. Alternatively, we propose the usage of vocabulary adaptation to enhance efficiency. Lastly, we evaluate the safety of Cendol and showcase that safety in pre-training in one language such as English is transferable to low-resource languages, such as Indonesian, even without RLHF and safety fine-tuning.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the poor performance of Indonesian and its local languages in large-scale language models (LLMs). Specifically, existing large language models exhibit a quality gap when handling Indonesian and other low-resource languages, resulting in subpar performance, inefficiency, and potential security issues. To tackle these problems, the research team developed Cendol, a series of large-scale language models specifically optimized for Indonesian and local languages. The Cendol models include decoder and encoder-decoder architectures, with parameter scales ranging from 300 million to 13 billion. Through extensive evaluation, the paper demonstrates significant improvements of Cendol across various tasks and discusses the limitations of parameter-efficient tuning methods (such as LoRA) and the importance of vocabulary adaptation strategies. Additionally, the paper explores whether the safety performance of English during pre-training can be transferred to low-resource languages like Indonesian. In summary, this research aims to enhance the representation and application effectiveness of Indonesian and other local languages in large-scale language models.