Data Pruning via Separability, Integrity, and Model Uncertainty-Aware Importance Sampling

Steven Grosz,Rui Zhao,Rajeev Ranjan,Hongcheng Wang,Manoj Aggarwal,Gerard Medioni,Anil Jain
2024-09-21
Abstract:This paper improves upon existing data pruning methods for image classification by introducing a novel pruning metric and pruning procedure based on importance sampling. The proposed pruning metric explicitly accounts for data separability, data integrity, and model uncertainty, while the sampling procedure is adaptive to the pruning ratio and considers both intra-class and inter-class separation to further enhance the effectiveness of pruning. Furthermore, the sampling method can readily be applied to other pruning metrics to improve their performance. Overall, the proposed approach scales well to high pruning ratio and generalizes better across different classification models, as demonstrated by experiments on four benchmark datasets, including the fine-grained classification scenario.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address several key issues in data pruning methods in machine learning, particularly in the application of image classification tasks. Specifically: 1. **Improving Robustness**: Existing data pruning methods lack robustness when facing data noise. For example, some methods rely solely on prediction error to measure the importance of samples, which makes it difficult to distinguish between difficult but useful samples and noisy samples. 2. **Mitigating Class Imbalance**: Existing pruning strategies may exacerbate class imbalance because they often strictly sample based on sample difficulty, ignoring the inherent difficulty differences between different classes. 3. **Adaptability Issues**: Existing methods lack flexibility in deciding whether to prune simple or difficult samples. This decision depends on the initial data volume and the chosen pruning ratio. To address the above issues, the authors propose a new data pruning metric—SIM (Separability, Integrity, and Model Uncertainty), and an improved sampling method—SIMS (SIM with Importance Sampling). This method not only considers the separability, integrity, and model uncertainty of the data but also improves pruning effectiveness through an adaptive importance sampling strategy. It performs particularly well under high pruning ratios and has better cross-model generalization capabilities. Experimental results show that this method outperforms several existing mainstream data pruning methods on four benchmark datasets.