Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai

Parinthapat Pengpun,Can Udomcharoenchaikit,Weerayut Buaphet,Peerat Limkonchotiwat

2024-11-23

Abstract:We present a synthetic data approach for instruction-tuning large language models (LLMs) for low-resource languages in a data-efficient manner, specifically focusing on Thai. We identify three key properties that contribute to the effectiveness of instruction-tuning datasets: fluency, diversity, and cultural context. We propose a seed-data-free framework for generating synthetic instruction-tuning data that incorporates these essential properties. Our framework employs an LLM to generate diverse topics, retrieve relevant contexts from Wikipedia, and create instructions for various tasks, such as question answering, summarization, and conversation. The experimental results show that our best-performing synthetic dataset, which incorporates all three key properties, achieves competitive performance using only 5,000 instructions when compared to state-of-the-art Thai LLMs trained on hundreds of thousands of instructions. Our code and dataset are publicly available at <a class="link-external link-https" href="https://github.com/parinzee/seed-free-synthetic-instruct" rel="external noopener nofollow">this https URL</a>.

Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to generate a high - quality instruction - tuned dataset with high data efficiency in low - resource languages (such as Thai) in order to improve the performance of large language models (LLMs). Specifically, the paper focuses on reducing the dependence on a large amount of labeled data through a synthetic data generation framework while maintaining or improving the model's performance on specific tasks. The paper proposes a framework without seed data for generating a synthetic instruction - tuned dataset containing three key attributes: fluency, diversity, and cultural background. Experimental results show that using a synthetic dataset with only 5,000 instructions can achieve performance comparable to that of the existing state - of - the - art Thai LLMs, while the latter usually requires tens of thousands or even hundreds of thousands of instructions for training. This not only significantly reduces the data requirements and related costs but also provides a more efficient method to improve the LLM performance in low - resource languages.

Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs

Maybe Only 0.5 Training Data Instruction Tuning

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Optimizing Instruction Synthesis: Effective Exploration of Evolutionary Space with Tree Search

CodecLM: Aligning Language Models with Tailored Synthetic Data

Harnessing the Power of David against Goliath: Exploring Instruction Data Generation without Using Closed-Source Models

LongForm: Effective Instruction Tuning with Reverse Instructions

Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of Large Language Models

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor

Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

Instruction Tuning for Large Language Models: A Survey

EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models

Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection

Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through Active Exploration

Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation