Abstract:Synthetic training data has gained prominence in numerous learning tasks and scenarios, offering advantages such as dataset augmentation, generalization evaluation, and privacy preservation. Despite these benefits, the efficiency of synthetic data generated by current methodologies remains inferior when training advanced deep models exclusively, limiting its practical utility. To address this challenge, we analyze the principles underlying training data synthesis for supervised learning and elucidate a principled theoretical framework from the distribution-matching perspective that explicates the mechanisms governing synthesis efficacy. Through extensive experiments, we demonstrate the effectiveness of our synthetic data across diverse image classification tasks, both as a replacement for and augmentation to real datasets, while also benefits such as out-of-distribution generalization, privacy preservation, and scalability. Specifically, we achieve 70.9% top1 classification accuracy on ImageNet1K when training solely with synthetic data equivalent to 1 X the original real data size, which increases to 76.0% when scaling up to 10 X synthetic data.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: currently, synthetic data is less effective than real data in training advanced deep - learning models, which limits its practical applications. Specifically, although synthetic data has advantages in data set augmentation, generalization evaluation, and privacy protection, etc., the synthetic data generated by existing methods has a large difference in distribution from real data, resulting in poor model performance. To solve this problem, the author proposes a theoretical framework based on distribution matching, aiming to theoretically explain and optimize the effectiveness of synthetic data. Through extensive experimental verification, the author demonstrates the effectiveness of their synthetic data in various image classification tasks, especially in replacing or enhancing real data sets, and at the same time has better generalization ability and privacy - protection characteristics. ### Main contributions of the paper: 1. **Introduced a theoretical framework based on distribution matching**, emphasizing two key factors: (1) the distribution difference between the target and synthetic data; (2) the size of the training set. 2. **Utilized the state - of - the - art text - to - image diffusion models (such as Stable Diffusion)**, through comprehensive analysis and improvement of training targets, conditional generation, and prior initialization, achieving better distribution alignment. 3. **Conducted empirical research on multiple benchmark data sets**, demonstrating the superior performance of synthetic data in image classification tasks and proving its advantages in generalization ability and privacy protection. ### Specific problem descriptions: - **Distribution difference**: Existing synthetic data generation methods fail to match the distribution of real data well, resulting in a decline in model performance. - **Training set size**: The amount of synthetic data is usually limited and difficult to compare with large - scale real data sets. - **Limitations of existing methods**: Previous solutions (such as prompt engineering and expensive reverse methods) are neither sufficient nor efficient and lack theoretical support. By solving these problems, the author not only improves the quality and practicality of synthetic data but also provides theoretical guidance and technical references for future research.

Real-Fake: Effective Training Data Synthesis Through Distribution Matching

Image‐level Dataset Synthesis with an End‐to‐end Trainable Framework

A Study on Improving Realism of Synthetic Data for Machine Learning

Is Synthetic Image Useful for Transfer Learning? An Investigation into Data Generation, Volume, and Utilization

Analyzing Effects of Fake Training Data on the Performance of Deep Learning Systems

Synthetic Data for Model Selection

Up to 100x Faster Data-Free Knowledge Distillation

Synthetic Image Data for Deep Learning

Beyond Photo Realism for Domain Adaptation from Synthetic Data

Exploring the Impact of Synthetic Data for Aerial-view Human Detection

SynFace: Face Recognition with Synthetic Data

Exploring the Potential of Synthetic Data to Replace Real Data

From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition

Improving the Effectiveness of Deep Generative Data

Synthetic Examples Improve Generalization for Rare Classes

Best Practices and Lessons Learned on Synthetic Data

The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better

Synthetic Data for Object Classification in Industrial Applications

Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images

Is Synthetic Data From Diffusion Models Ready for Knowledge Distillation?

If It's Not Enough, Make It So: Reducing Authentic Data Demand in Face Recognition through Synthetic Faces