Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai

Parinthapat Pengpun,Can Udomcharoenchaikit,Weerayut Buaphet,Peerat Limkonchotiwat
2024-11-23
Abstract:We present a synthetic data approach for instruction-tuning large language models (LLMs) for low-resource languages in a data-efficient manner, specifically focusing on Thai. We identify three key properties that contribute to the effectiveness of instruction-tuning datasets: fluency, diversity, and cultural context. We propose a seed-data-free framework for generating synthetic instruction-tuning data that incorporates these essential properties. Our framework employs an LLM to generate diverse topics, retrieve relevant contexts from Wikipedia, and create instructions for various tasks, such as question answering, summarization, and conversation. The experimental results show that our best-performing synthetic dataset, which incorporates all three key properties, achieves competitive performance using only 5,000 instructions when compared to state-of-the-art Thai LLMs trained on hundreds of thousands of instructions. Our code and dataset are publicly available at <a class="link-external link-https" href="https://github.com/parinzee/seed-free-synthetic-instruct" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to generate a high - quality instruction - tuned dataset with high data efficiency in low - resource languages (such as Thai) in order to improve the performance of large language models (LLMs). Specifically, the paper focuses on reducing the dependence on a large amount of labeled data through a synthetic data generation framework while maintaining or improving the model's performance on specific tasks. The paper proposes a framework without seed data for generating a synthetic instruction - tuned dataset containing three key attributes: fluency, diversity, and cultural background. Experimental results show that using a synthetic dataset with only 5,000 instructions can achieve performance comparable to that of the existing state - of - the - art Thai LLMs, while the latter usually requires tens of thousands or even hundreds of thousands of instructions for training. This not only significantly reduces the data requirements and related costs but also provides a more efficient method to improve the LLM performance in low - resource languages.