Accelerating k-Means Clustering with Cover Trees

Andreas Lang,Erich Schubert
DOI: https://doi.org/10.1007/978-3-031-46994-7_13
2024-10-19
Abstract:The k-means clustering algorithm is a popular algorithm that partitions data into k clusters. There are many improvements to accelerate the standard algorithm. Most current research employs upper and lower bounds on point-to-cluster distances and the triangle inequality to reduce the number of distance computations, with only arrays as underlying data structures. These approaches cannot exploit that nearby points are likely assigned to the same cluster. We propose a new k-means algorithm based on the cover tree index, that has relatively low overhead and performs well, for a wider parameter range, than previous approaches based on the k-d tree. By combining this with upper and lower bounds, as in state-of-the-art approaches, we obtain a hybrid algorithm that combines the benefits of tree aggregation and bounds-based filtering.
Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to accelerate the execution speed of the k - means clustering algorithm. Specifically, the author proposes a new method based on the cover tree to improve the k - means algorithm. Traditional methods mainly reduce the number of distance calculations by using upper and lower bounds and the triangle inequality, but these methods usually only rely on arrays as the underlying data structure and cannot fully utilize the characteristic that neighboring points are likely to be assigned to the same cluster. #### Main problems and solutions 1. **Limitations of existing methods**: - Most of the existing acceleration methods are based on k - d trees or directly use the triangle inequality for pruning, and these methods perform poorly in high - dimensional data or under specific data distributions. - These methods fail to fully utilize the relationships between neighboring points in the data set, resulting in limited efficiency improvements. 2. **The proposed new method**: - The author introduces the cover tree, which is a hierarchical spherical covering structure that can effectively organize data and reduce unnecessary distance calculations. - By combining the cover tree with the pruning strategy of the triangle inequality, the new method can perform well in a wider range of parameters, especially when dealing with high - dimensional data. 3. **Specific implementation**: - Use the cover tree to index the data so that the entire subset can be assigned to the cluster center at once, thereby reducing the number of distance calculations in the iteration. - Take advantage of the cover tree in the early iterations and switch to an optimization algorithm based on stored boundaries (such as Hamerly or Shallot) in the later iterations to further improve performance. 4. **Experimental verification**: - The paper conducts experiments on multiple real - world data sets to verify the effectiveness and superiority of the new method. The results show that the new method has a significant improvement in reducing the number of distance calculations and running time. #### Summary The main goal of the paper is to accelerate the execution of the k - means clustering algorithm by introducing the cover tree index structure and an optimized distance calculation strategy, especially for high - dimensional data and large - scale data sets. This method not only improves the overall performance of the algorithm but also shows wide applicability under different data distributions.