A Weighted K-Center Algorithm for Data Subset Selection

Srikumar Ramalingam,Pranjal Awasthi,Sanjiv Kumar
2023-12-17
Abstract:The success of deep learning hinges on enormous data and large models, which require labor-intensive annotations and heavy computation costs. Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data, which can then be used to produce similar models as the ones trained with full data. Two prior methods are shown to achieve impressive results: (1) margin sampling that focuses on selecting points with high uncertainty, and (2) core-sets or clustering methods such as k-center for informative and diverse subsets. We are not aware of any work that combines these methods in a principled manner. To this end, we develop a novel and efficient factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions. To handle large datasets, we show a parallel algorithm to run on multiple machines with approximation guarantees. The proposed algorithm achieves similar or better performance compared to other strong baselines on vision datasets such as CIFAR-10, CIFAR-100, and ImageNet.
Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily aims to address the problem of how to effectively select a smaller subset of data from a large-scale training dataset in the context of deep learning, so that models trained on these subsets perform similarly to those trained on the full dataset. To solve this problem, the authors propose a new algorithm—the Weighted K-Center Algorithm, which combines two previously proven effective methods: Margin Sampling and Core-Sets based K-Center Clustering. Specifically, the algorithm selects data subsets by minimizing the weighted sum of the K-Center objective function and the Margin Sampling objective function, and it is capable of handling large datasets by running in parallel on multiple machines. The key contributions of the paper include: 1. Designing an efficient and novel algorithm to minimize the weighted sum of the K-Center and Margin Sampling objective functions. 2. Proving that the proposed Weighted K-Center Algorithm has a constant factor 3 approximation guarantee. 3. Proposing an alternative parallel version of the algorithm that can run on multiple machines and proving that this algorithm has a constant factor 14 approximation guarantee. 4. Experimental results on standard image datasets such as CIFAR-10, CIFAR-100, and ImageNet show that the proposed algorithm outperforms other baseline methods. In summary, this research aims to find a more effective data subset selection strategy by combining the dimensions of uncertainty and diversity, thereby reducing the cost of manual labeling and the demand for computational resources.