Abstract:Typhoon is a series of Thai large language models (LLMs) developed specifically for the Thai language. This technical report presents challenges and insights in developing Thai LLMs, including data preparation, pretraining, instruction-tuning, and evaluation. As one of the challenges of low-resource languages is the amount of pretraining data, we apply continual training to transfer existing world knowledge from a strong LLM. To evaluate the Thai knowledge encapsulated in each model from the pretraining stage, we develop ThaiExam, a benchmark based on examinations for high-school students and investment professionals in Thailand. In addition, we fine-tune Typhoon to follow Thai instructions, and we evaluate instruction-tuned models on Thai instruction datasets as well as translation, summarization, and question-answering tasks. Experimental results on a suite of Thai benchmarks show that Typhoon outperforms all open-source Thai language models, and its performance is on par with GPT-3.5 in Thai while having only 7 billion parameters and being 2.62 times more efficient in tokenizing Thai text.

What problem does this paper attempt to address?

The paper attempts to address issues primarily focused on developing large-scale language models (LLMs) suitable for Thai, in order to improve the performance of these models in handling Thai tasks. Specifically, the paper focuses on the following aspects: 1. **Data Preparation and Training**: Since Thai is a low-resource language, the amount of data available for pre-training is limited and of varying quality. The paper explores how to transfer world knowledge to Thai models using methods such as continual training with existing powerful language models. 2. **Evaluating Thai Knowledge**: To assess the Thai knowledge contained in the model during the pre-training phase, the researchers developed a benchmark test called "ThaiExam," based on exam questions from Thai high school students and investment professionals. Additionally, the model was evaluated after instruction tuning, including tasks such as translation, summarization, and question answering. 3. **Instruction Tuning**: The paper investigates how to make the model better follow Thai instructions through instruction tuning. This includes translating English instruction tuning datasets into Thai and using automatically generated Thai instruction datasets for supervised tuning. 4. **Performance Comparison**: The paper compares Typhoon with other open-source and proprietary large-scale Thai language models across multiple benchmarks, demonstrating that Typhoon achieves state-of-the-art performance in several tasks, especially in instruction-following tasks and standard natural language processing tasks. Overall, the paper aims to enhance the performance of large-scale Thai language models through a series of technical means and innovative methods, making them better serve the Thai community.

Typhoon: Thai Large Language Models

Thai Financial Domain Adaptation of THaLLE -- Technical Report

Representing the Under-Represented: Cultural and Core Capability Benchmarks for Developing Thai Large Language Models

Eir: Thai Medical Large Language Models

Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

OpenThaiGPT 1.5: A Thai-Centric Open Source Large Language Model

Can General-Purpose Large Language Models Generalize to English-Thai Machine Translation ?

Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models

ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models

Xmodel-1.5: An 1B-scale Multilingual LLM

WangchanBERTa: Pretraining transformer-based Thai Language Models

Thai Wav2Vec2.0 with CommonVoice V8

ThaiNutriChat: development of a Thai large language model-based chatbot for health food services

Efficient Finetuning Large Language Models For Vietnamese Chatbot

THaLLE: Text Hyperlocally Augmented Large Language Extension -- Technical Report

Performance of Recent Large Language Models for a Low-Resourced Language

LaoPLM: Pre-trained Language Models for Lao

Tele-LLMs: A Series of Specialized Large Language Models for Telecommunications

Typhoon: Towards an Effective Task-Specific Masking Strategy for Pre-trained Language Models

PolyLM: An Open Source Polyglot Large Language Model

An analysis of large language models: their impact and potential applications