Large-scale Dataset Pruning with Dynamic Uncertainty

Muyang He,Shuo Yang,Tiejun Huang,Bo Zhao
2024-06-14
Abstract:The state of the art of many learning tasks, e.g., image classification, is advanced by collecting larger datasets and then training larger models on them. As the outcome, the increasing computational cost is becoming unaffordable. In this paper, we investigate how to prune the large-scale datasets, and thus produce an informative subset for training sophisticated deep models with negligible performance drop. We propose a simple yet effective dataset pruning method by exploring both the prediction uncertainty and training dynamics. We study dataset pruning by measuring the variation of predictions during the whole training process on large-scale datasets, i.e., ImageNet-1K and ImageNet-21K, and advanced models, i.e., Swin Transformer and ConvNeXt. Extensive experimental results indicate that our method outperforms the state of the art and achieves 25% lossless pruning ratio on both ImageNet-1K and ImageNet-21K. The code and pruned datasets are available at <a class="link-external link-https" href="https://github.com/BAAI-DCAI/Dataset-Pruning" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issue of pruning large-scale datasets to improve data efficiency in deep learning, achieving performance comparable to the full dataset while reducing training costs. Specifically, the paper proposes a simple and effective dataset pruning method—Dynamic Uncertainty (Dyn-Unc), which selects highly informative subsets by exploring prediction uncertainty and training dynamics. This method can achieve up to 25% lossless pruning on large-scale datasets such as ImageNet-1K and ImageNet-21K, significantly outperforming existing dataset pruning methods. Moreover, experimental results show that the core subsets selected by Dyn-Unc can generalize well to other unseen model architectures.