Dataset Distillation via Curriculum Data Synthesis in Large Data Era

Zeyuan Yin,Zhiqiang Shen
2024-11-25
Abstract:Dataset distillation or condensation aims to generate a smaller but representative subset from a large dataset, which allows a model to be trained more efficiently, meanwhile evaluating on the original testing data distribution to achieve decent performance. Previous decoupled methods like SRe$^2$L simply use a unified gradient update scheme for synthesizing data from Gaussian noise, while, we notice that the initial several update iterations will determine the final outline of synthesis, thus an improper gradient update strategy may dramatically affect the final generation quality. To address this, we introduce a simple yet effective global-to-local gradient refinement approach enabled by curriculum data augmentation ($\texttt{CDA}$) during data synthesis. The proposed framework achieves the current published highest accuracy on both large-scale ImageNet-1K and 21K with 63.2% under IPC (Images Per Class) 50 and 36.1% under IPC 20, using a regular input resolution of 224$\times$224 with faster convergence speed and less synthetic time. The proposed model outperforms the current state-of-the-art methods like SRe$^2$L, TESLA, and MTT by more than 4% Top-1 accuracy on ImageNet-1K/21K and for the first time, reduces the gap to its full-data training counterparts to less than absolute 15%. Moreover, this work represents the inaugural success in dataset distillation on the larger-scale ImageNet-21K dataset under the standard 224$\times$224 resolution. Our code and distilled ImageNet-21K dataset of 20 IPC, 2K recovery budget are available at <a class="link-external link-https" href="https://github.com/VILA-Lab/SRe2L/tree/main/CDA" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is, in the context of the big data era, how to generate a smaller yet representative subset from large - scale datasets, so that the model can be trained more efficiently and achieve good performance on the original test data distribution. Specifically, the paper focuses on dataset distillation or concentration techniques, aiming to optimize the process of extracting a small and streamlined subset from large datasets while maintaining that these subsets can effectively support the model to learn from scratch as effectively as if they were learning from the complete large datasets. The paper points out that existing dataset distillation methods such as SRe2L use a unified gradient update scheme when synthesizing data, which may seriously affect the quality of the finally generated data, especially in the first few update iterations. To solve this problem, the paper proposes a global - to - local gradient refinement method based on Curriculum Data Augmentation (CDA), which gradually adjusts the difficulty of the cropping area during the data synthesis process, from simple overall structures to complex local details, in order to improve the quality of the synthesized data. The main contributions of the paper include: 1. Proposing a new Curriculum Data Augmentation (CDA) framework, which realizes the distillation of large - scale datasets through a global - to - local gradient update strategy. 2. Successfully distilling large - scale datasets such as ImageNet - 21K for the first time, significantly narrowing the performance gap with the model trained on the full - amount data, reaching a gap of less than 15% in absolute value. 3. Conducting extensive experiments on multiple datasets such as CIFAR - 100, Tiny - ImageNet, ImageNet - 1K and ImageNet - 21K, proving the effectiveness of the proposed method. Through these contributions, the paper not only promotes the development of dataset distillation technology, but also provides new ideas and methods for processing large - scale datasets, which helps to reduce the requirements for storage and computing resources and protect data privacy at the same time.