Abstract:Synthetic data generation with Large Language Models is a promising paradigm for augmenting natural data over a nearly infinite range of tasks. Given this variety, direct comparisons among synthetic data generation algorithms are scarce, making it difficult to understand where improvement comes from and what bottlenecks exist. We propose to evaluate algorithms via the makeup of synthetic data generated by each algorithm in terms of data quality, diversity, and complexity. We choose these three characteristics for their significance in open-ended processes and the impact each has on the capabilities of downstream models. We find quality to be essential for in-distribution model generalization, diversity to be essential for out-of-distribution generalization, and complexity to be beneficial for both. Further, we emphasize the existence of Quality-Diversity trade-offs in training data and the downstream effects on model performance. We then examine the effect of various components in the synthetic data pipeline on each data characteristic. This examination allows us to taxonomize and compare synthetic data generation algorithms through the components they utilize and the resulting effects on data QDC composition. This analysis extends into a discussion on the importance of balancing QDC in synthetic data for efficient reinforcement learning and self-improvement algorithms. Analogous to the QD trade-offs in training data, often there exist trade-offs between model output quality and output diversity which impact the composition of synthetic data. We observe that many models are currently evaluated and optimized only for output quality, thereby limiting output diversity and the potential for self-improvement. We argue that balancing these trade-offs is essential to the development of future self-improvement algorithms and highlight a number of works making progress in this direction.

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch

Give me Some Hard Questions: Synthetic Data Generation for Clinical QA

FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

Training Question Answering Models From Synthetic Data

SyntheT2C: Generating Synthetic Data for Fine-Tuning Large Language Models on the Text2Cypher Task

TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data

SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages

Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation

From Words to Code: Harnessing Data for Program Synthesis from Natural Language

Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science

Unnatural Language Processing: Bridging the Gap Between Synthetic and Natural Language Data

Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA

SynthCypher: A Fully Synthetic Data Generation Framework for Text-to-Cypher Querying in Knowledge Graphs

Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models

Better Synthetic Data by Retrieving and Transforming Existing Datasets