On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

Lin Long,Rui Wang,Ruixuan Xiao,Junbo Zhao,Xiao Ding,Gang Chen,Haobo Wang
2024-06-14
Abstract:Within the evolving landscape of deep learning, the dilemma of data quantity and quality has been a long-standing problem. The recent advent of Large Language Models (LLMs) offers a data-centric solution to alleviate the limitations of real-world data with synthetic data generation. However, current investigations into this field lack a unified framework and mostly stay on the surface. Therefore, this paper provides an organization of relevant studies based on a generic workflow of synthetic data generation. By doing so, we highlight the gaps within existing research and outline prospective avenues for future study. This work aims to shepherd the academic and industrial communities towards deeper, more methodical inquiries into the capabilities and applications of LLMs-driven synthetic data generation.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the long-standing issues of data volume and data quality in the field of deep learning. Specifically, although the emergence of large language models (LLMs) has provided new solutions for generating synthetic data, current research in this area lacks a unified framework and mostly remains superficial. Therefore, this paper organizes related research through a general workflow based on synthetic data generation to highlight gaps in existing research and propose directions for future research. The main goal of the paper is to guide academia and industry in a more in-depth and systematic exploration of the capabilities and applications of LLMs-driven synthetic data generation. ### Specific Problems 1. **Data Volume and Data Quality Issues**: - High-quality data is crucial for building robust natural language processing (NLP) models. - Human-generated data is difficult to meet this demand due to high costs, data scarcity, and privacy issues. - Human-generated data may also contain biases and errors, which are detrimental to model training and evaluation. 2. **Effectiveness and Scalability of Synthetic Data Generation**: - LLMs can generate fluent text comparable to human output, providing an effective method for synthetic data generation. - Synthetic data can supplement or replace human-generated data, addressing issues of data volume and data quality. 3. **Shortcomings of Existing Research**: - Lack of a unified framework to guide research on synthetic data generation. - Existing research mostly focuses on data generation for specific tasks and domains, lacking comprehensive reviews and methodologies. ### Solutions 1. **Establishing a Unified Workflow**: - This paper organizes related research through three main aspects: generation, management, and evaluation, providing a systematic workflow. 2. **Identifying Key Research Areas**: - By reviewing existing research, this paper identifies key issues and unresolved gaps in current studies. 3. **Promoting Further Development**: - The paper aims to provide valuable insights for academia and industry, promoting further development in the field of LLMs-driven synthetic data generation.