Data-Efficient Generation for Dataset Distillation

Zhe Li,Weitong Zhang,Sarah Cechnicka,Bernhard Kainz
2024-09-06
Abstract:While deep learning techniques have proven successful in image-related tasks, the exponentially increased data storage and computation costs become a significant challenge. Dataset distillation addresses these challenges by synthesizing only a few images for each class that encapsulate all essential information. Most current methods focus on matching. The problems lie in the synthetic images not being human-readable and the dataset performance being insufficient for downstream learning tasks. Moreover, the distillation time can quickly get out of bounds when the number of synthetic images per class increases even slightly. To address this, we train a class conditional latent diffusion model capable of generating realistic synthetic images with labels. The sampling time can be reduced to several tens of images per seconds. We demonstrate that models can be effectively trained using only a small set of synthetic images and evaluated on a large real test set. Our approach achieved rank \(1\) in The First Dataset Distillation Challenge at ECCV 2024 on the CIFAR100 and TinyImageNet datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address several key issues in dataset distillation: 1. **Generating highly readable synthetic images**: Existing methods often produce synthetic images that are not readable. The method proposed in this paper can generate high-quality and human-readable synthetic images. 2. **Improving downstream task performance**: Most current methods fail to support downstream learning tasks when generating a small number of synthetic images. The method in this paper improves the model's performance on downstream tasks by using a conditional diffusion model to generate high-fidelity synthetic images. 3. **Reducing computational costs**: The distillation time increases rapidly when the number of synthetic images per category slightly increases. The proposed method can generate a large number of synthetic images in a short time, and the computational cost does not grow exponentially with the number of synthetic images. With these improvements, the authors achieved first place in the first dataset distillation challenge at ECCV 2024, particularly excelling on the CIFAR100 and TinyImageNet datasets.