Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages

Samuel Cahyawijaya,Holy Lovenia,Fajri Koto,Rifki Afina Putri,Emmanuel Dave,Jhonson Lee,Nuur Shadieq,Wawan Cenggoro,Salsabil Maulana Akbar,Muhammad Ihza Mahendra,Dea Annisayanti Putri,Bryan Wilie,Genta Indra Winata,Alham Fikri Aji,Ayu Purwarianti,Pascale Fung

2024-07-08

Abstract:Large language models (LLMs) show remarkable human-like capability in various domains and languages. However, a notable quality gap arises in low-resource languages, e.g., Indonesian indigenous languages, rendering them ineffective and inefficient in such linguistic contexts. To bridge this quality gap, we introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder-decoder architectures across a range of model sizes. We highlight Cendol's effectiveness across a diverse array of tasks, attaining 20% improvement, and demonstrate its capability to generalize to unseen tasks and indigenous languages of Indonesia. Furthermore, Cendol models showcase improved human favorability despite their limitations in capturing indigenous knowledge and cultural values in Indonesia. In addition, we discuss the shortcomings of parameter-efficient tunings, such as LoRA, for language adaptation. Alternatively, we propose the usage of vocabulary adaptation to enhance efficiency. Lastly, we evaluate the safety of Cendol and showcase that safety in pre-training in one language such as English is transferable to low-resource languages, such as Indonesian, even without RLHF and safety fine-tuning.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the poor performance of Indonesian and its local languages in large-scale language models (LLMs). Specifically, existing large language models exhibit a quality gap when handling Indonesian and other low-resource languages, resulting in subpar performance, inefficiency, and potential security issues. To tackle these problems, the research team developed Cendol, a series of large-scale language models specifically optimized for Indonesian and local languages. The Cendol models include decoder and encoder-decoder architectures, with parameter scales ranging from 300 million to 13 billion. Through extensive evaluation, the paper demonstrates significant improvements of Cendol across various tasks and discusses the limitations of parameter-efficient tuning methods (such as LoRA) and the importance of vocabulary adaptation strategies. Additionally, the paper explores whether the safety performance of English during pre-training can be transferred to low-resource languages like Indonesian. In summary, this research aims to enhance the representation and application effectiveness of Indonesian and other local languages in large-scale language models.

Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages

IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation

Komodo: A Linguistic Expedition into Indonesia's Regional Languages

Performance of Recent Large Language Models for a Low-Resourced Language

Efficient Continual Pre-training of LLMs for Low-resource Languages

Can Large Language Models Code Like a Linguist?: A Case Study in Low Resource Sound Law Induction

IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages

Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLU

Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts

Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in Indonesia

Do Large Language Models Speak All Languages Equally? A Comparative Study in Low-Resource Settings

Compass: Large Multilingual Language Model for South-east Asia

Hire a Linguist!: Learning Endangered Languages with In-Context Linguistic Descriptions

SambaLingo: Teaching Large Language Models New Languages

IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

LLM for Everyone: Representing the Underrepresented in Large Language Models

End-to-end indonesian speech recognition with convolutional and gated recurrent units

Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese

NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models