Distribution-Aware Data Expansion with Diffusion Models

Haowei Zhu,Ling Yang,Jun-Hai Yong,Hongzhi Yin,Jiawei Jiang,Meng Xiao,Wentao Zhang,Bin Wang
2024-06-05
Abstract:The scale and quality of a dataset significantly impact the performance of deep models. However, acquiring large-scale annotated datasets is both a costly and time-consuming endeavor. To address this challenge, dataset expansion technologies aim to automatically augment datasets, unlocking the full potential of deep models. Current data expansion techniques include image transformation and image synthesis methods. Transformation-based methods introduce only local variations, leading to limited diversity. In contrast, synthesis-based methods generate entirely new content, greatly enhancing informativeness. However, existing synthesis methods carry the risk of distribution deviations, potentially degrading model performance with out-of-distribution samples. In this paper, we propose DistDiff, a training-free data expansion framework based on the distribution-aware diffusion model. DistDiff constructs hierarchical prototypes to approximate the real data distribution, optimizing latent data points within diffusion models with hierarchical energy guidance. We demonstrate its capability to generate distribution-consistent samples, significantly improving data expansion tasks. DistDiff consistently enhances accuracy across a diverse range of datasets compared to models trained solely on original data. Furthermore, our approach consistently outperforms existing synthesis-based techniques and demonstrates compatibility with widely adopted transformation-based augmentation methods. Additionally, the expanded dataset exhibits robustness across various architectural frameworks. Our code is available at <a class="link-external link-https" href="https://github.com/haoweiz23/DistDiff" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the data scarcity in deep - learning models due to the high cost and long time required to obtain large - scale labeled datasets. Specifically, the paper proposes a pre - training data expansion framework named DistDiff, which aims to automatically generate new samples consistent with the real - data distribution through a distribution - aware diffusion model, thereby improving the quality and quantity of the dataset without retraining, and further enhancing the performance of the deep model. ### Main Contributions 1. **Proposing a new diffusion - based data expansion algorithm**: DistDiff can achieve distribution - consistent data augmentation without retraining. 2. **Using hierarchical prototypes to approximate the data distribution**: Construct effective distribution - aware energy guidance through class - level and group - level prototypes to optimize the diffusion sampling process. 3. **Experimental results show**: DistDiff can generate high - quality samples, significantly outperforming existing image transformation and synthesis methods and significantly improving the performance of downstream models. ### Method Overview - **Task Definition**: On a small - scale training dataset, the data expansion task aims to expand the original dataset \( D_o \) by generating new synthetic samples \( D_s \) to improve the performance of deep - learning models. - **Hierarchical Prototypes to Approximate the Data Distribution**: Capture the original data distribution through class - level and group - level prototypes. The class - level prototype \( p_c \) is obtained by averaging the feature vectors of the same class, and the group - level prototype \( p_g \) is obtained by clustering methods. - **Transforming Data Points**: Generate new samples using a pre - trained large - scale diffusion model and adjust the latent features through residual multiplication transformation. - **Distribution - Aware Diffusion Generation**: In the typical diffusion sampling process, optimize the intermediate denoising steps through energy guidance to ensure that the generated samples are consistent with the real - data distribution. ### Experimental Results - **Comparison with Synthesis Methods**: In the classification performance test on the Caltech - 101 dataset, DistDiff shows significant advantages over existing methods, with an average accuracy improvement of 6.25%. - **Comparison with Transformation Methods**: In the image classification task on the Caltech - 101 dataset, DistDiff outperforms traditional transformation methods and has better performance when combined with these methods. ### Conclusion The DistDiff framework proposed in the paper generates high - quality synthetic samples through a distribution - aware diffusion model, effectively solves the data scarcity problem, and significantly improves the performance of deep - learning models.