Data-Efficient Generation for Dataset Distillation

Zhe Li,Weitong Zhang,Sarah Cechnicka,Bernhard Kainz

2024-09-06

Abstract:While deep learning techniques have proven successful in image-related tasks, the exponentially increased data storage and computation costs become a significant challenge. Dataset distillation addresses these challenges by synthesizing only a few images for each class that encapsulate all essential information. Most current methods focus on matching. The problems lie in the synthetic images not being human-readable and the dataset performance being insufficient for downstream learning tasks. Moreover, the distillation time can quickly get out of bounds when the number of synthetic images per class increases even slightly. To address this, we train a class conditional latent diffusion model capable of generating realistic synthetic images with labels. The sampling time can be reduced to several tens of images per seconds. We demonstrate that models can be effectively trained using only a small set of synthetic images and evaluated on a large real test set. Our approach achieved rank \(1\) in The First Dataset Distillation Challenge at ECCV 2024 on the CIFAR100 and TinyImageNet datasets.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address several key issues in dataset distillation: 1. **Generating highly readable synthetic images**: Existing methods often produce synthetic images that are not readable. The method proposed in this paper can generate high-quality and human-readable synthetic images. 2. **Improving downstream task performance**: Most current methods fail to support downstream learning tasks when generating a small number of synthetic images. The method in this paper improves the model's performance on downstream tasks by using a conditional diffusion model to generate high-fidelity synthetic images. 3. **Reducing computational costs**: The distillation time increases rapidly when the number of synthetic images per category slightly increases. The proposed method can generate a large number of synthetic images in a short time, and the computational cost does not grow exponentially with the number of synthetic images. With these improvements, the authors achieved first place in the first dataset distillation challenge at ECCV 2024, particularly excelling on the CIFAR100 and TinyImageNet datasets.

Data-Efficient Generation for Dataset Distillation

Efficient Dataset Distillation via Minimax Diffusion

Generative Dataset Distillation Based on Diffusion Model

Dataset Distillation via Curriculum Data Synthesis in Large Data Era

Data-to-Model Distillation: Data-Efficient Learning Framework

Curriculum Dataset Distillation

Latent Dataset Distillation with Diffusion Models

Generalizing Dataset Distillation via Deep Generative Prior

Generative Dataset Distillation: Balancing Global Structure and Local Details

Accelerating Dataset Distillation Via Model Augmentation

One Category One Prompt: Dataset Distillation using Diffusion Models

Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality

Diffusion-Augmented Coreset Expansion for Scalable Dataset Distillation

Exploiting Inter-sample and Inter-feature Relations in Dataset Distillation

DiM: Distilling Dataset into Generative Model

Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios

DataDAM: Efficient Dataset Distillation with Attention Matching

Data-Free Adversarial Distillation

Image Distillation for Safe Data Sharing in Histopathology

Distilling Datasets Into Less Than One Image

Dataset Distillation in Latent Space