On the Diversity of Synthetic Data and its Impact on Training Large Language Models

Hao Chen,Abdul Waheed,Xiang Li,Yidong Wang,Jindong Wang,Bhiksha Raj,Marah I. Abdin
2024-10-23
Abstract:The rise of Large Language Models (LLMs) has accentuated the need for diverse, high-quality pre-training data. Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility. While previous literature has focused predominantly on the quality and quantity of real data, our work enables the measurement of diversity in synthetic data and explores its impact on LLM performance. We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages by introducing a new diversity metric, \textit{LLM cluster-agent}, designed to evaluate the diversity of synthetic datasets. Through a series of controlled experiments with models of 350M and 1.4B parameters, we demonstrate that the proposed cluster-based LLM scoring of diversity correlates positively with both pre-training and supervised fine-tuning performance. Our findings also reveal that synthetic data diversity in pre-training affects supervised fine-tuning more significantly than pre-training itself, even for smaller models. We hope this study advances our understanding of the optimal use of synthetic data in LLM training and opens new avenues for efficient data generation processes.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to explore the impact of synthetic data diversity on the performance of large language models (LLMs). Specifically, the paper focuses on the following aspects: 1. **Measurement of Synthetic Data Diversity**: Existing research mainly focuses on the quality and quantity of real data, with fewer methods for measuring the diversity of synthetic data. The paper proposes a new metric—LLM Cluster-agent, to evaluate the diversity of synthetic data. 2. **Impact of Synthetic Data Diversity on Pre-training and Fine-tuning**: Through a series of controlled experiments, the study investigates the impact of synthetic data diversity on the performance of LLMs during the pre-training and supervised fine-tuning stages. The experiments include models of different scales (350M and 1.4B parameters) and synthetic datasets with varying diversity. 3. **Optimal Synthetic Data Generation Strategy**: The paper explores how to generate more diverse synthetic data to improve the performance of LLMs. This includes studying the underlying distribution of synthetic data, generation prompts and models, and the ratio of synthetic to real data. ### Main Findings 1. **Positive Correlation Between LLM Cluster Score and Performance**: The study shows that the LLM Cluster Score is positively correlated with performance in pre-training and supervised fine-tuning. This indicates that the metric can effectively predict the future performance of LLMs. 2. **Impact of Underlying Distribution of Synthetic Data on Performance**: More unique topics generally provide better diversity, but too many generations may introduce redundancy, thereby harming performance. 3. **Different Generation Prompts and Models Improve Diversity**: Combining generation prompts with different text styles and target audiences can significantly enhance the diversity of synthetic data and the performance of LLMs. 4. **Better Generation Models Produce More Diverse Data**: Synthetic data generated by more advanced models (such as GPT-4) is more diverse than that generated by lower-level models (such as GPT-3.5), thereby improving the performance of the training models. 5. **Balance Between Real and Synthetic Data**: A more balanced ratio of real to synthetic data is most beneficial for the performance of LLMs. Over-reliance on synthetic data may lead to a decrease in diversity, thereby harming performance. 6. **Impact of Diversity on Small and Large Models**: While the pre-training performance of small models saturates faster with increasing synthetic data diversity, greater diversity still significantly enhances the performance of supervised fine-tuning. ### Conclusion The LLM Cluster-agent metric proposed in the paper demonstrates potential applications in practical and large-scale LLM synthetic data pre-training. The research findings provide valuable insights into efficient and diverse synthetic data generation processes, helping to improve the performance of LLMs.