Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Yung-Chieh Chan,George Pu,Apaar Shanker,Parth Suresh,Penn Jenks,John Heyer,Sam Denton
2024-10-30
Abstract:As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to unlock model performance, but is prohibitively expensive in many scenarios. Several alternative methods have also emerged, such as generating synthetic or hybrid data, but the effectiveness of these approaches remain unclear, especially in resource-constrained scenarios and tasks that are not easily verified. To investigate this, we group various synthetic data generation strategies into three representative categories -- Answer Augmentation, Question Rephrase and New Question -- and study the performance of student LLMs trained under various constraints, namely seed instruction set size and query budget. We demonstrate that these strategies are not equally effective across settings. Notably, the optimal data generation strategy depends strongly on the ratio between the available teacher query budget and the size of the seed instruction set. When this ratio is low, generating new answers to existing questions proves most effective, but as this ratio increases, generating new questions becomes optimal. Across all tasks, we find that choice of augmentation method and other design choices matter substantially more in low to mid data regimes than in high data regimes. We provide a practical framework for selecting the appropriate augmentation method across settings, taking into account additional factors such as the scalability of each method, the importance of verifying synthetic data, and the use of different LLMs for synthetic data generation.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to select the most effective synthetic data generation strategy to train large language models (LLMs) under resource - constrained circumstances. Specifically, the paper focuses on how to balance cost and effectiveness through synthetic data generation strategies to improve the performance of student models (student LLMs) under different data constraint conditions. The authors explored three main synthetic data generation methods - Answer Augmentation, Question Rephrasing, and New Question Generation, and studied the performance of these methods in different task types (such as mathematics, programming, and general question answering). The paper specifically points out that the best data generation strategy depends on the ratio of the teacher query budget to the size of the seed instruction set (Budget Ratio, BR). When this ratio is low, enhancing the answers to existing questions is the most effective; as the ratio increases, generating new questions becomes more effective. In addition, the paper also provides a practical framework for selecting appropriate data - enhancement methods, considering the scalability of each method, the importance of synthetic data validation, and the impact of using different LLMs for synthetic data generation.