Coresets for Kernel Clustering

Shaofeng H.-C. Jiang,Robert Krauthgamer,Jianing Lou,Yubo Zhang
2024-04-06
Abstract:We devise coresets for kernel $k$-Means with a general kernel, and use them to obtain new, more efficient, algorithms. Kernel $k$-Means has superior clustering capability compared to classical $k$-Means, particularly when clusters are non-linearly separable, but it also introduces significant computational challenges. We address this computational issue by constructing a coreset, which is a reduced dataset that accurately preserves the clustering costs.
Computer Science
What problem does this paper attempt to address?
The focus of this paper is on the clustering problem using kernel methods, especially the computational challenges faced when applying kernel k-means. Kernel k-means has stronger clustering ability than traditional k-means when dealing with non-linearly separable datasets, but it also introduces high computational complexity. In order to address this issue, the paper proposes a "coreset" method for general kernel functions in kernel k-means. A coreset is a simplified dataset that preserves the clustering cost of the original dataset for all candidate centers with an accuracy of (1±ε). The main contribution of the paper is the design of a coreset of size poly(k/ε) for kernel k-means, which can be constructed in nearly linear time. This new coreset is not only more general than previous achievements, but also significantly improves scalability and efficiency. Additionally, the paper demonstrates how to apply this coreset to fast approximate algorithms and streaming algorithms. The experimental results show that the proposed coreset performs well on various datasets and different kernel functions, reducing the number of required points while maintaining a low error rate. By accelerating kernel k-means++ (a kernel version of k-means++), the paper further applies it to spectral clustering, achieving significant speed improvement and better asymptotic growth compared to the baseline method without using the coreset. The paper also compares its method with other techniques such as uniform sampling, dimensionality reduction methods, and other kernel-based approximate algorithms, pointing out their shortcomings in terms of coreset size, computation time, and guaranteed accuracy. Finally, the paper provides experimental results that demonstrate the efficiency and accuracy of the coreset in practical applications.