Abstract:Large models, encompassing large language and diffusion models, have shown exceptional promise in approximating human-level intelligence, garnering significant interest from both academic and industrial spheres. However, the training of these large models necessitates vast quantities of high-quality data, and with continuous updates to these models, the existing reservoir of high-quality data may soon be depleted. This challenge has catalyzed a surge in research focused on data augmentation methods. Leveraging large models, these data augmentation techniques have outperformed traditional approaches. This paper offers an exhaustive review of large model-driven data augmentation methods, adopting a comprehensive perspective. We begin by establishing a classification of relevant studies into three main categories: image augmentation, text augmentation, and paired data augmentation. Following this, we delve into various data post-processing techniques pertinent to large model-based data augmentation. Our discussion then expands to encompass the array of applications for these data augmentation methods within natural language processing, computer vision, and audio signal processing. We proceed to evaluate the successes and limitations of large model-based data augmentation across different scenarios. Concluding our review, we highlight prospective challenges and avenues for future exploration in the field of data augmentation. Our objective is to furnish researchers with critical insights, ultimately contributing to the advancement of more sophisticated large models. We consistently maintain the related open-source materials at:

What problem does this paper attempt to address?

The paper primarily explores data augmentation methods driven by large models, particularly focusing on the use of large-scale language models (LLMs) and diffusion models for data augmentation. The core objective of the paper is to generate more diverse and high-quality data through these advanced technologies to support the training of more complex large models. Specifically, the paper addresses the following key issues: 1. **Background Introduction**: It first introduces the basic concepts of data augmentation and its importance, and provides an overview of the development of large-scale language models and diffusion models, as well as their potential applications in the field of data augmentation. 2. **Review Content**: - **Method Classification**: The related research is divided into three main categories—image augmentation, text augmentation, and paired data augmentation, with a detailed discussion of the characteristics and application scenarios of each method. - **Data Post-Processing Techniques**: It explores how to optimize the quality of augmented data through different strategies, including Top-K selection, model-based methods, score-based methods, and clustering-based methods. - **Application Cases**: It analyzes the practical application effects of these data augmentation methods in various fields such as natural language processing, computer vision, and audio signal processing. 3. **Challenges and Future Directions**: The paper also discusses the current challenges, such as insufficient theoretical understanding, limitations on the quantity of augmented data, and the difficulty of multimodal data augmentation, and proposes future research directions. Through this review, the authors aim to provide researchers with a comprehensive understanding framework to promote further exploration and development in the field of data augmentation.

A Survey on Data Augmentation in Large Model Era

A Survey on Data Augmentation in Large Model Era

Data Augmentation using Large Language Models: Data Perspectives, Learning Paradigms and Challenges

A Comprehensive Survey on Data Augmentation

A Survey on Data Synthesis and Augmentation for Large Language Models

Image Data Augmentation for Deep Learning: A Survey

A Survey of Data Augmentation Approaches for NLP

A Brief Survey on Semantic-preserving Data Augmentation

An Empirical Survey of Data Augmentation for Limited Data Learning in NLP

Large Model-Based Data Augmentation for Imbalanced Text Classification

A survey on Image Data Augmentation for Deep Learning

Data Augmentation Approaches in Natural Language Processing: A Survey

Image data augmentation techniques based on deep learning: A survey

Survey on Sequence Data Augmentation

Improving Text Classification with Large Language Model-Based Data Augmentation

Empowering Large Language Models for Textual Data Augmentation

Source Code Data Augmentation for Deep Learning: A Survey

Augmentation Policy Generation for Image Classification Using Large Language Models

Exploring Data Augmentation Methods on Social Media Corpora

Data Augmentation in Human-Centric Vision

Time Series Data Augmentation for Deep Learning: A Survey