BWS: Best Window Selection Based on Sample Scores for Data Pruning across Broad Ranges

Hoyong Choi,Nohyun Ki,Hye Won Chung
2024-06-05
Abstract:Data subset selection aims to find a smaller yet informative subset of a large dataset that can approximate the full-dataset training, addressing challenges associated with training neural networks on large-scale datasets. However, existing methods tend to specialize in either high or low selection ratio regimes, lacking a universal approach that consistently achieves competitive performance across a broad range of selection ratios. We introduce a universal and efficient data subset selection method, Best Window Selection (BWS), by proposing a method to choose the best window subset from samples ordered based on their difficulty scores. This approach offers flexibility by allowing the choice of window intervals that span from easy to difficult samples. Furthermore, we provide an efficient mechanism for selecting the best window subset by evaluating its quality using kernel ridge regression. Our experimental results demonstrate the superior performance of BWS compared to other baselines across a broad range of selection ratios over datasets, including CIFAR-10/100 and ImageNet, and the scenarios involving training from random initialization or fine-tuning of pre-trained models.
Machine Learning
What problem does this paper attempt to address?
The paper primarily addresses the challenges faced when training neural networks on large-scale datasets, such as high computational costs, storage requirements, and privacy issues, by proposing a new data subset selection method—Best Window Selection (BWS). The core objective of the research is to develop a general and efficient data subset selection method that remains competitive across a wide range of selection ratios. ### Research Background and Problem Definition Existing data subset selection methods typically perform well at high or low selection ratios, but a method that performs well across the entire range of selection ratios has yet to emerge. For example: - **Score-based selection methods** (such as sorting based on sample difficulty) tend to perform close to the full dataset at high selection ratios, but their performance drops significantly as the selection ratio decreases. - **Optimization-based selection methods** (such as finding the best subset that approximates the full dataset's loss gradient through optimization) perform well at low selection ratios, but their performance improvement is limited as the selection ratio increases. ### Main Contributions The BWS method proposed in the paper aims to address the above issues, specifically: 1. **Introduction of the BWS Method**: BWS first sorts the samples based on their difficulty scores and selects "window subsets" from them. These window subsets consist of consecutively ranked samples and can be chosen from easy to difficult subsets by varying the window's starting point. In this way, BWS can flexibly adapt to different selection ratio requirements. 2. **Efficient Evaluation Mechanism**: To efficiently select the best window subset, BWS uses Kernel Ridge Regression (KRR) to evaluate the quality of each window subset. This method avoids the need to actually train a model to assess the performance of each possible subset, thereby greatly improving efficiency. 3. **Extensive Applicability Verification**: The paper experimentally verifies the effectiveness of the BWS method on the CIFAR-10/100 and ImageNet datasets, demonstrating that it can outperform other baseline methods across a selection ratio range of 1% to 90%. Particularly at low selection ratios, BWS brings significant performance improvements compared to score-based methods; while at high selection ratios, BWS also maintains competitiveness. 4. **Theoretical Analysis and Empirical Support**: The paper also provides theoretical analysis, revealing the changes in subset characteristics required at different selection ratios, and empirically demonstrates these theoretical results. ### Conclusion In summary, the paper proposes a new method called BWS, aimed at overcoming the limitations of existing data subset selection methods to achieve robust performance across a wide range of selection ratios, from extremely low to extremely high. By combining a flexible window subset selection strategy with an efficient evaluation mechanism, BWS offers a comprehensive and effective solution for data subset selection.