Abstract:The rise of Large Language Models (LLMs) has accentuated the need for diverse, high-quality pre-training data. Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility. While previous literature has focused predominantly on the quality and quantity of real data, our work enables the measurement of diversity in synthetic data and explores its impact on LLM performance. We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages by introducing a new diversity metric, \textit{LLM cluster-agent}, designed to evaluate the diversity of synthetic datasets. Through a series of controlled experiments with models of 350M and 1.4B parameters, we demonstrate that the proposed cluster-based LLM scoring of diversity correlates positively with both pre-training and supervised fine-tuning performance. Our findings also reveal that synthetic data diversity in pre-training affects supervised fine-tuning more significantly than pre-training itself, even for smaller models. We hope this study advances our understanding of the optimal use of synthetic data in LLM training and opens new avenues for efficient data generation processes.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to explore the impact of synthetic data diversity on the performance of large language models (LLMs). Specifically, the paper focuses on the following aspects: 1. **Measurement of Synthetic Data Diversity**: Existing research mainly focuses on the quality and quantity of real data, with fewer methods for measuring the diversity of synthetic data. The paper proposes a new metric—LLM Cluster-agent, to evaluate the diversity of synthetic data. 2. **Impact of Synthetic Data Diversity on Pre-training and Fine-tuning**: Through a series of controlled experiments, the study investigates the impact of synthetic data diversity on the performance of LLMs during the pre-training and supervised fine-tuning stages. The experiments include models of different scales (350M and 1.4B parameters) and synthetic datasets with varying diversity. 3. **Optimal Synthetic Data Generation Strategy**: The paper explores how to generate more diverse synthetic data to improve the performance of LLMs. This includes studying the underlying distribution of synthetic data, generation prompts and models, and the ratio of synthetic to real data. ### Main Findings 1. **Positive Correlation Between LLM Cluster Score and Performance**: The study shows that the LLM Cluster Score is positively correlated with performance in pre-training and supervised fine-tuning. This indicates that the metric can effectively predict the future performance of LLMs. 2. **Impact of Underlying Distribution of Synthetic Data on Performance**: More unique topics generally provide better diversity, but too many generations may introduce redundancy, thereby harming performance. 3. **Different Generation Prompts and Models Improve Diversity**: Combining generation prompts with different text styles and target audiences can significantly enhance the diversity of synthetic data and the performance of LLMs. 4. **Better Generation Models Produce More Diverse Data**: Synthetic data generated by more advanced models (such as GPT-4) is more diverse than that generated by lower-level models (such as GPT-3.5), thereby improving the performance of the training models. 5. **Balance Between Real and Synthetic Data**: A more balanced ratio of real to synthetic data is most beneficial for the performance of LLMs. Over-reliance on synthetic data may lead to a decrease in diversity, thereby harming performance. 6. **Impact of Diversity on Small and Large Models**: While the pre-training performance of small models saturates faster with increasing synthetic data diversity, greater diversity still significantly enhances the performance of supervised fine-tuning. ### Conclusion The LLM Cluster-agent metric proposed in the paper demonstrates potential applications in practical and large-scale LLM synthetic data pre-training. The research findings provide valuable insights into efficient and diverse synthetic data generation processes, helping to improve the performance of LLMs.

On the Diversity of Synthetic Data and its Impact on Training Large Language Models

On the Diversity of Synthetic Data and its Impact on Training Large Language Models

Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data

The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text

A Survey on Data Synthesis and Augmentation for Large Language Models

Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

Towards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

Hybrid Training Approaches for LLMs: Leveraging Real and Synthetic Data to Enhance Model Performance in Domain-Specific Applications

Harnessing Diversity for Important Data Selection in Pretraining Large Language Models

Under the Surface: Tracking the Artifactuality of LLM-Generated Data

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation

How to Train Data-Efficient LLMs

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions

Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models

Efficacy of Synthetic Data as a Benchmark