Scalable Kernel $K$-Means with Randomized Sketching: from Theory to Algorithm

Rong Yin,Yong Liu,Weiping Wang,Dan Meng
DOI: https://doi.org/10.1109/tkde.2022.3222146
IF: 9.235
2023-01-01
IEEE Transactions on Knowledge and Data Engineering
Abstract:Kernel $k$ -means is a fundamental unsupervised learning in data mining. Its computational requirements are typically at least quadratic in the number of data, which are prohibitive for large-scale scenarios. To address these issues, we propose a novel randomized sketching approach SKK based on the circulant matrix. SKK projects the kernel matrix left and right according to the proposed sketch matrices to obtain a smaller one and accelerates the matrix-matrix product by the fast Fourier transform based on the circulant matrix, which can greatly reduce the computational requirements of the approximate kernel $k$ -means estimator with the same generalization bound as the exact kernel $k$ -means in the statistical setting. In particular, theoretical analysis shows that taking the sketch dimension of $\sqrt{n}$ is sufficient for SKK to achieve the optimal excess risk bound with only a fraction of computations, where $n$ is the number of data. The extensive experiments verify our theoretical analysis, and SKK achieves the state-of-the-art performances on 12 real-world datasets. To the best of our knowledge, in randomized sketching, this is the first time that unsupervised learning makes such a significant breakthrough.
What problem does this paper attempt to address?