SeaLLMs -- Large Language Models for Southeast Asia

Xuan-Phi Nguyen,Wenxuan Zhang,Xin Li,Mahani Aljunied,Zhiqiang Hu,Chenhui Shen,Yew Ken Chia,Xingxuan Li,Jianyu Wang,Qingyu Tan,Liying Cheng,Guanzheng Chen,Yue Deng,Sen Yang,Chaoqun Liu,Hang Zhang,Lidong Bing
2024-07-01
Abstract:Despite the remarkable achievements of large language models (LLMs) in various tasks, there remains a linguistic bias that favors high-resource languages, such as English, often at the expense of low-resource and regional languages. To address this imbalance, we introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning to better capture the intricacies of regional languages. This allows them to respect and reflect local cultural norms, customs, stylistic preferences, and legal considerations. Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities relative to comparable open-source models. Moreover, they outperform ChatGPT-3.5 in non-Latin languages, such as Thai, Khmer, Lao, and Burmese, by large margins while remaining lightweight and cost-effective to operate.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the imbalance in Southeast Asian language processing by large language models (LLMs). Although existing large language models perform excellently in various tasks, they primarily favor resource-rich languages like English, leading to poor performance in low-resource and regional languages. To address this imbalance, the research team introduced the SeaLLMs series of models, which focus on Southeast Asian languages and better capture the complexity of regional languages through vocabulary expansion, specialized instruction tuning, and alignment optimization. SeaLLMs not only demonstrate superior performance in various language tasks but also show significant improvement over ChatGPT-3.5 in non-Latin script languages such as Thai, Khmer, Lao, and Burmese, while remaining lightweight and cost-effective. In this way, SeaLLMs help reduce the technological gap in language diversity and promote access to cutting-edge AI technology for non-English speakers.