Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Yung-Chieh Chan,George Pu,Apaar Shanker,Parth Suresh,Penn Jenks,John Heyer,Sam Denton

2024-10-30

Abstract:As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to unlock model performance, but is prohibitively expensive in many scenarios. Several alternative methods have also emerged, such as generating synthetic or hybrid data, but the effectiveness of these approaches remain unclear, especially in resource-constrained scenarios and tasks that are not easily verified. To investigate this, we group various synthetic data generation strategies into three representative categories -- Answer Augmentation, Question Rephrase and New Question -- and study the performance of student LLMs trained under various constraints, namely seed instruction set size and query budget. We demonstrate that these strategies are not equally effective across settings. Notably, the optimal data generation strategy depends strongly on the ratio between the available teacher query budget and the size of the seed instruction set. When this ratio is low, generating new answers to existing questions proves most effective, but as this ratio increases, generating new questions becomes optimal. Across all tasks, we find that choice of augmentation method and other design choices matter substantially more in low to mid data regimes than in high data regimes. We provide a practical framework for selecting the appropriate augmentation method across settings, taking into account additional factors such as the scalability of each method, the importance of verifying synthetic data, and the use of different LLMs for synthetic data generation.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to select the most effective synthetic data generation strategy to train large language models (LLMs) under resource - constrained circumstances. Specifically, the paper focuses on how to balance cost and effectiveness through synthetic data generation strategies to improve the performance of student models (student LLMs) under different data constraint conditions. The authors explored three main synthetic data generation methods - Answer Augmentation, Question Rephrasing, and New Question Generation, and studied the performance of these methods in different task types (such as mathematics, programming, and general question answering). The paper specifically points out that the best data generation strategy depends on the ratio of the teacher query budget to the size of the seed instruction set (Budget Ratio, BR). When this ratio is low, enhancing the answers to existing questions is the most effective; as the ratio increases, generating new questions becomes more effective. In addition, the paper also provides a practical framework for selecting appropriate data - enhancement methods, considering the scalability of each method, the importance of synthetic data validation, and the impact of using different LLMs for synthetic data generation.

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

A Survey on Data Synthesis and Augmentation for Large Language Models

Evaluating Language Models as Synthetic Data Generators

LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of Large Language Models

Investigating Cost-Efficiency of LLM-Generated Training Data for Conversational Semantic Frame Analysis

Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models

Hybrid Training Approaches for LLMs: Leveraging Real and Synthetic Data to Enhance Model Performance in Domain-Specific Applications

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

Efficacy of Synthetic Data as a Benchmark

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Exploring LLMs as a Source of Targeted Synthetic Textual Data to Minimize High Confidence Misclassifications

Data Generation Using Large Language Models for Text Classification: An Empirical Case Study

LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?

Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs

Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration

On the Diversity of Synthetic Data and its Impact on Training Large Language Models