BanditPAM++: Faster $k$-medoids Clustering

Mo Tiwari,Ryan Kang,Donghyun Lee,Sebastian Thrun,Chris Piech,Ilan Shomorony,Martin Jinye Zhang
2023-10-29
Abstract:Clustering is a fundamental task in data science with wide-ranging applications. In $k$-medoids clustering, cluster centers must be actual datapoints and arbitrary distance metrics may be used; these features allow for greater interpretability of the cluster centers and the clustering of exotic objects in $k$-medoids clustering, respectively. $k$-medoids clustering has recently grown in popularity due to the discovery of more efficient $k$-medoids algorithms. In particular, recent research has proposed BanditPAM, a randomized $k$-medoids algorithm with state-of-the-art complexity and clustering accuracy. In this paper, we present BanditPAM++, which accelerates BanditPAM via two algorithmic improvements, and is $O(k)$ faster than BanditPAM in complexity and substantially faster than BanditPAM in wall-clock runtime. First, we demonstrate that BanditPAM has a special structure that allows the reuse of clustering information $\textit{within}$ each iteration. Second, we demonstrate that BanditPAM has additional structure that permits the reuse of information $\textit{across}$ different iterations. These observations inspire our proposed algorithm, BanditPAM++, which returns the same clustering solutions as BanditPAM but often several times faster. For example, on the CIFAR10 dataset, BanditPAM++ returns the same results as BanditPAM but runs over 10$\times$ faster. Finally, we provide a high-performance C++ implementation of BanditPAM++, callable from Python and R, that may be of interest to practitioners at <a class="link-external link-https" href="https://github.com/motiwari/BanditPAM" rel="external noopener nofollow">this https URL</a>. Auxiliary code to reproduce all of our experiments via a one-line script is available at <a class="link-external link-https" href="https://github.com/ThrunGroup/BanditPAM_plusplus_experiments" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the issue of low efficiency in existing algorithms for k-medoids clustering on large datasets. Specifically, the paper proposes the BanditPAM++ algorithm, an improved version of the BanditPAM algorithm, which aims to enhance computational efficiency through two algorithmic improvements while maintaining the same clustering results as the original BanditPAM. The main contributions include: 1. **Virtual Arms (VA) technology**: This technology allows for the reuse of distance calculations in each iteration, thereby reducing the computational load and decreasing the complexity of each SWAP operation by a factor of O(k). 2. **Permutation-Invariant Caching (PIC) technology**: This technique allows for the reuse of computation results across different iterations, further reducing the actual runtime. These improvements make BanditPAM++ faster than previous algorithms when handling large-scale datasets, which is particularly important in the context of processing massive amounts of data. Experimental results show that in some cases, BanditPAM++ can run more than 10 times faster than BanditPAM. Additionally, the paper provides highly optimized code implemented in C++, which can be called from Python and R, making it convenient for users in practical applications.