Abstract:In the rapidly evolving field of large language models (LLMs), data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection. This survey explores the transformative impact of LLMs on DA, particularly addressing the unique challenges and opportunities they present in the context of natural language processing (NLP) and beyond. From both data and learning perspectives, we examine various strategies that utilize LLMs for data augmentation, including a novel exploration of learning paradigms where LLM-generated data is used for diverse forms of further training. Additionally, this paper highlights the primary open challenges faced in this domain, ranging from controllable data augmentation to multi-modal data augmentation. This survey highlights a paradigm shift introduced by LLMs in DA, and aims to serve as a comprehensive guide for researchers and practitioners.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to explore the application of large - language models (LLMs) in data augmentation (DA) and their transformative impacts. Specifically, the paper attempts to solve the following key problems: 1. **The Difficulty in Obtaining High - Quality Data**: - **Problem Background**: High - quality data is crucial for training high - performance artificial intelligence models, but obtaining such data is often costly and time - consuming. Moreover, the data - labeling process usually requires a great deal of human participation and is prone to inaccurate labeling. - **Solution**: By using LLMs for data augmentation, high - quality synthetic data can be generated without additional data collection, thus improving the performance of the model. 2. **Innovation in Data Augmentation Methods**: - **Problem Background**: Traditional data augmentation methods mainly focus on simple transformations of existing data, such as synonym replacement or sentence reorganization in text, and these methods have limitations in improving model performance. - **Solution**: The paper proposes new methods for data augmentation using LLMs, including multiple strategies such as data creation, data labeling, data reconstruction, and collaborative labeling. These methods can generate more diverse and high - quality synthetic data. 3. **Expansion of Learning Paradigms**: - **Problem Background**: Traditional machine - learning tasks, such as machine translation, sentiment analysis, and named - entity recognition, have developed relatively mature methods, but there is still a large exploration space in some emerging learning paradigms, such as instruction tuning, in - context learning, and alignment learning. - **Solution**: The paper explores how to use the synthetic data generated by LLMs to train models, thereby expanding to a wider range of learning paradigms, such as generating pseudo - data for classification tasks and scoring data for regression tasks. 4. **Challenges and Future Directions**: - **Problem Background**: Although LLMs show great potential in data augmentation, they still face some challenges, such as data pollution, enhanced controllability, cross - cultural data augmentation, multi - modal data augmentation, and privacy protection. - **Solution**: The paper discusses these challenges in detail and proposes future research directions in order to further promote research and development in this field. ### Summary By systematically reviewing and analyzing the application of LLMs in data augmentation, this paper not only solves the problem of obtaining high - quality data but also innovates data augmentation methods and expands the application scope of learning paradigms. At the same time, the paper also points out the main challenges in current research and provides guidance for further research in the future.

Data Augmentation using Large Language Models: Data Perspectives, Learning Paradigms and Challenges

A Survey on Data Augmentation in Large Model Era

A Survey on Data Synthesis and Augmentation for Large Language Models

Data Augmentation Approaches in Natural Language Processing: A Survey

Large Language Models for Data Annotation: A Survey

LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named Entity Recognition

Large Language Models for Data Annotation and Synthesis: A Survey

Large Language Models for Education: A Survey and Outlook

Improving Text Classification with Large Language Model-Based Data Augmentation

Empowering Large Language Models for Textual Data Augmentation

An Empirical Survey of Data Augmentation for Limited Data Learning in NLP

Applications and Challenges for Large Language Models: from Data Management Perspective

Enhancing Intent Classifier Training with Large Language Model-generated Data

Controllable and Diverse Data Augmentation with Large Language Model for Low-Resource Open-Domain Dialogue Generation

Data Augmentation for Text-based Person Retrieval Using Large Language Models

A Survey of Data Augmentation Approaches for NLP

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Learnings from Data Integration for Augmented Language Models

Security and Privacy Challenges of Large Language Models: A Survey