A Survey on Data Synthesis and Augmentation for Large Language Models

Ke Wang,Jiahui Zhu,Minjie Ren,Zeming Liu,Shiwei Li,Zongye Zhang,Chenkai Zhang,Xiaoyu Wu,Qiqi Zhan,Qingjie Liu,Yunhong Wang

2024-10-17

Abstract:The success of Large Language Models (LLMs) is inherently linked to the availability of vast, diverse, and high-quality data for training and evaluation. However, the growth rate of high-quality data is significantly outpaced by the expansion of training datasets, leading to a looming data exhaustion crisis. This underscores the urgent need to enhance data efficiency and explore new data sources. In this context, synthetic data has emerged as a promising solution. Currently, data generation primarily consists of two major approaches: data augmentation and synthesis. This paper comprehensively reviews and summarizes data generation techniques throughout the lifecycle of LLMs, including data preparation, pre-training, fine-tuning, instruction-tuning, preference alignment, and applications. Furthermore, We discuss the current constraints faced by these methods and investigate potential pathways for future development and research. Our aspiration is to equip researchers with a clear understanding of these methodologies, enabling them to swiftly identify appropriate data generation strategies in the construction of LLMs, while providing valuable insights for future exploration.

Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the scarcity of high - quality data in the development of large - language models (LLMs). As the scale of models continues to grow, the demand for large - scale, diverse, and high - quality training data is increasing day by day. However, the growth rate of high - quality data lags far behind the expansion rate of training data sets, which may lead to a data - depletion crisis. Therefore, the paper emphasizes the urgency of improving data efficiency and exploring new data sources, especially in terms of synthetic data. The paper comprehensively reviews and summarizes the data - generation techniques throughout the life cycle of LLMs, including aspects such as data preparation, pre - training, fine - tuning, instruction tuning, preference alignment, and application. In addition, the paper also discusses the current limitations faced by these methods and explores potential paths for future development, aiming to provide researchers with a clear methodological understanding, enabling them to quickly identify applicable data - generation strategies when constructing LLMs and provide valuable insights for future exploration.

A Survey on Data Synthesis and Augmentation for Large Language Models

Large Language Models for Data Annotation and Synthesis: A Survey

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

A Survey on Data Augmentation in Large Model Era

Data Augmentation using Large Language Models: Data Perspectives, Learning Paradigms and Challenges

Data Generation Using Large Language Models for Text Classification: An Empirical Case Study

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Generative AI for Synthetic Data Generation: Methods, Challenges and the Future

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

Evaluating Language Models as Synthetic Data Generators

On the Diversity of Synthetic Data and its Impact on Training Large Language Models

Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

Large Language Models for Data Annotation: A Survey

Mastering the Craft of Data Synthesis for CodeLLMs

A Survey of Large Language Models

Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration

Large language models and synthetic health data: progress and prospects

Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems