Abstract:Dataset distillation aims to compress information from a large-scale original dataset to a new compact dataset while striving to preserve the utmost degree of the original data informational essence. Previous studies have predominantly concentrated on aligning the intermediate statistics between the original and distilled data, such as weight trajectory, features, gradient, BatchNorm, etc. In this work, we consider addressing this task through the new lens of model informativeness in the compression stage on the original dataset pretraining. We observe that with the prior state-of-the-art SRe$^2$L, as model sizes increase, it becomes increasingly challenging for supervised pretrained models to recover learned information during data synthesis, as the channel-wise mean and variance inside the model are flatting and less informative. We further notice that larger variances in BN statistics from self-supervised models enable larger loss signals to update the recovered data by gradients, enjoying more informativeness during synthesis. Building on this observation, we introduce SC-DD, a simple yet effective Self-supervised Compression framework for Dataset Distillation that facilitates diverse information compression and recovery compared to traditional supervised learning schemes, further reaps the potential of large pretrained models with enhanced capabilities. Extensive experiments are conducted on CIFAR-100, Tiny-ImageNet and ImageNet-1K datasets to demonstrate the superiority of our proposed approach. The proposed SC-DD outperforms all previous state-of-the-art supervised dataset distillation methods when employing larger models, such as SRe$^2$L, MTT, TESLA, DC, CAFE, etc., by large margins under the same recovery and post-training budgets. Code is available at

What problem does this paper attempt to address?

This paper focuses on the problem of dataset distillation, which is how to compress information from a large original dataset into a compact new dataset while preserving the essential information of the original dataset. Previous research has mainly focused on aligning intermediate statistics between the original data and the distilled data, such as weight trajectories, features, gradients, and BatchNorm. However, this paper proposes a new perspective, which is the impact of the information content of the model during the pre-training phase of the original dataset on the dataset distillation task. The study found that as the model size increases, it becomes difficult for the supervised pre-trained model to recover learning information during data synthesis because the means and variances of the internal channels of the model tend to be flat, resulting in a reduction in information content. In contrast, the BatchNorm statistics of the self-supervised model have larger variances, which makes the recovered data more informative through gradient updates. The paper proposes SC-DD (Self-supervised Compression for Dataset Distillation), which is a simple but effective self-supervised compression framework that promotes the compression and recovery of more informative data compared to traditional supervised learning strategies, and fully utilizes the enhancement capabilities of large pre-trained models. Experimental results show that SC-DD outperforms previous supervised dataset distillation methods on CIFAR-100, Tiny-ImageNet, and ImageNet-1K datasets, especially when using larger models. The paper also reveals that the intermediate distribution of self-supervised pre-training is more informative for dataset distillation, and emphasizes the importance of model selection and the mismatch between BN statistics distribution in the pre-training phase. In this way, SC-DD can maintain or improve performance across different model sizes, demonstrating a positive correlation between model size and performance in the post-training phase.

Self-supervised Dataset Distillation: A Good Compression Is All You Need

DCCD: Reducing Neural Network Redundancy Via Distillation

Dataset Distillation via Curriculum Data Synthesis in Large Data Era

Dataset Distillation: A Comprehensive Review

A Comprehensive Survey of Dataset Distillation

Curriculum Dataset Distillation

Accelerating Dataset Distillation Via Model Augmentation

Diffusion-Augmented Coreset Expansion for Scalable Dataset Distillation

Self-Supervised Dataset Distillation for Transfer Learning

Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks

Distilling Datasets Into Less Than One Image

Efficient Dataset Distillation via Diffusion-Driven Patch Selection for Improved Generalization

Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale From A New Perspective

Exploiting Inter-sample and Inter-feature Relations in Dataset Distillation

Distributed Boosting: an Enhancing Method on Dataset Distillation

What is Dataset Distillation Learning?

Dataset Distillation from First Principles: Integrating Core Information Extraction and Purposeful Learning

Importance-Aware Adaptive Dataset Distillation

Dataset Distillation with Channel Efficient Process

Data-to-Model Distillation: Data-Efficient Learning Framework