Abstract:Within the evolving landscape of deep learning, the dilemma of data quantity and quality has been a long-standing problem. The recent advent of Large Language Models (LLMs) offers a data-centric solution to alleviate the limitations of real-world data with synthetic data generation. However, current investigations into this field lack a unified framework and mostly stay on the surface. Therefore, this paper provides an organization of relevant studies based on a generic workflow of synthetic data generation. By doing so, we highlight the gaps within existing research and outline prospective avenues for future study. This work aims to shepherd the academic and industrial communities towards deeper, more methodical inquiries into the capabilities and applications of LLMs-driven synthetic data generation.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the long-standing issues of data volume and data quality in the field of deep learning. Specifically, although the emergence of large language models (LLMs) has provided new solutions for generating synthetic data, current research in this area lacks a unified framework and mostly remains superficial. Therefore, this paper organizes related research through a general workflow based on synthetic data generation to highlight gaps in existing research and propose directions for future research. The main goal of the paper is to guide academia and industry in a more in-depth and systematic exploration of the capabilities and applications of LLMs-driven synthetic data generation. ### Specific Problems 1. **Data Volume and Data Quality Issues**: - High-quality data is crucial for building robust natural language processing (NLP) models. - Human-generated data is difficult to meet this demand due to high costs, data scarcity, and privacy issues. - Human-generated data may also contain biases and errors, which are detrimental to model training and evaluation. 2. **Effectiveness and Scalability of Synthetic Data Generation**: - LLMs can generate fluent text comparable to human output, providing an effective method for synthetic data generation. - Synthetic data can supplement or replace human-generated data, addressing issues of data volume and data quality. 3. **Shortcomings of Existing Research**: - Lack of a unified framework to guide research on synthetic data generation. - Existing research mostly focuses on data generation for specific tasks and domains, lacking comprehensive reviews and methodologies. ### Solutions 1. **Establishing a Unified Workflow**: - This paper organizes related research through three main aspects: generation, management, and evaluation, providing a systematic workflow. 2. **Identifying Key Research Areas**: - By reviewing existing research, this paper identifies key issues and unresolved gaps in current studies. 3. **Promoting Further Development**: - The paper aims to provide valuable insights for academia and industry, promoting further development in the field of LLMs-driven synthetic data generation.

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

A Survey on Data Synthesis and Augmentation for Large Language Models

Large Language Models for Data Annotation and Synthesis: A Survey

Generative AI for Synthetic Data Generation: Methods, Challenges and the Future

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

Data Generation Using Large Language Models for Text Classification: An Empirical Case Study

Under the Surface: Tracking the Artifactuality of LLM-Generated Data

A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models

Large Language Models for Data Annotation: A Survey

Large Language Models in Law: A Survey

A Survey on Detection of LLMs-Generated Content

Introduction to special issue: how does neuroscience inform psychological treatment?

Machine Learning for Synthetic Data Generation: A Review

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Exploring the Nexus of Large Language Models and Legal Systems: A Short Survey

A Survey on LLM-Generated Text Detection: Necessity, Methods, and Future Directions

Comprehensive Exploration of Synthetic Data Generation: A Survey

On the Diversity of Synthetic Data and its Impact on Training Large Language Models

A Survey on Evaluation of Large Language Models