Real-Fake: Effective Training Data Synthesis Through Distribution Matching

Jianhao Yuan,Jie Zhang,Shuyang Sun,Philip Torr,Bo Zhao
2024-03-20
Abstract:Synthetic training data has gained prominence in numerous learning tasks and scenarios, offering advantages such as dataset augmentation, generalization evaluation, and privacy preservation. Despite these benefits, the efficiency of synthetic data generated by current methodologies remains inferior when training advanced deep models exclusively, limiting its practical utility. To address this challenge, we analyze the principles underlying training data synthesis for supervised learning and elucidate a principled theoretical framework from the distribution-matching perspective that explicates the mechanisms governing synthesis efficacy. Through extensive experiments, we demonstrate the effectiveness of our synthetic data across diverse image classification tasks, both as a replacement for and augmentation to real datasets, while also benefits such as out-of-distribution generalization, privacy preservation, and scalability. Specifically, we achieve 70.9% top1 classification accuracy on ImageNet1K when training solely with synthetic data equivalent to 1 X the original real data size, which increases to 76.0% when scaling up to 10 X synthetic data.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: currently, synthetic data is less effective than real data in training advanced deep - learning models, which limits its practical applications. Specifically, although synthetic data has advantages in data set augmentation, generalization evaluation, and privacy protection, etc., the synthetic data generated by existing methods has a large difference in distribution from real data, resulting in poor model performance. To solve this problem, the author proposes a theoretical framework based on distribution matching, aiming to theoretically explain and optimize the effectiveness of synthetic data. Through extensive experimental verification, the author demonstrates the effectiveness of their synthetic data in various image classification tasks, especially in replacing or enhancing real data sets, and at the same time has better generalization ability and privacy - protection characteristics. ### Main contributions of the paper: 1. **Introduced a theoretical framework based on distribution matching**, emphasizing two key factors: (1) the distribution difference between the target and synthetic data; (2) the size of the training set. 2. **Utilized the state - of - the - art text - to - image diffusion models (such as Stable Diffusion)**, through comprehensive analysis and improvement of training targets, conditional generation, and prior initialization, achieving better distribution alignment. 3. **Conducted empirical research on multiple benchmark data sets**, demonstrating the superior performance of synthetic data in image classification tasks and proving its advantages in generalization ability and privacy protection. ### Specific problem descriptions: - **Distribution difference**: Existing synthetic data generation methods fail to match the distribution of real data well, resulting in a decline in model performance. - **Training set size**: The amount of synthetic data is usually limited and difficult to compare with large - scale real data sets. - **Limitations of existing methods**: Previous solutions (such as prompt engineering and expensive reverse methods) are neither sufficient nor efficient and lack theoretical support. By solving these problems, the author not only improves the quality and practicality of synthetic data but also provides theoretical guidance and technical references for future research.