Noisy k-means++ Revisited

Christoph Grunau,Ahmet Alper Özüdoğru,Václav Rozhoň

2023-07-26

Abstract:The $k$-means++ algorithm by Arthur and Vassilvitskii [SODA 2007] is a classical and time-tested algorithm for the $k$-means problem. While being very practical, the algorithm also has good theoretical guarantees: its solution is $O(\log k)$-approximate, in expectation. In a recent work, Bhattacharya, Eube, Roglin, and Schmidt [ESA 2020] considered the following question: does the algorithm retain its guarantees if we allow for a slight adversarial noise in the sampling probability distributions used by the algorithm? This is motivated e.g. by the fact that computations with real numbers in $k$-means++ implementations are inexact. Surprisingly, the analysis under this scenario gets substantially more difficult and the authors were able to prove only a weaker approximation guarantee of $O(\log^2 k)$. In this paper, we close the gap by providing a tight, $O(\log k)$-approximate guarantee for the $k$-means++ algorithm with noise.

Data Structures and Algorithms

What problem does this paper attempt to address?

The paper primarily focuses on whether the k-means++ algorithm can still maintain its original theoretical guarantees in the presence of slight adversarial noise. Specifically, the k-means++ algorithm, proposed by Arthur and Vassilvitskii, is a classic clustering algorithm that is not only very practical in application but also achieves an expected $O(\log k)$ approximation ratio in theory. The paper first reviews previous work, specifically the question posed by Bhattacharya et al. in 2020: If the sampling probability distribution used in the k-means++ algorithm allows for slight adversarial noise (i.e., the sampling probability can be slightly altered), can the algorithm still maintain its original theoretical performance? This question is important in practical applications because numerical errors are always present in actual computations. Bhattacharya et al. proved that under such circumstances, the algorithm can only achieve a weaker $O(\log^2 k)$ approximation ratio. The main contribution of this paper is that the authors address the above question by proving through analysis that even in the presence of adversarial noise, the k-means++ algorithm can still maintain an $O(\log k)$ approximation ratio. This is achieved through the analysis of a process called the "adversarial sampling process," which simulates the noise effects in the k-means++ algorithm and proves that such noise would at most cause a constant factor performance degradation. Therefore, this paper fills the gap in previous work and provides a more compact theoretical analysis of the k-means++ algorithm in the presence of noise.

Noisy k-means++ Revisited

A Nearly Tight Analysis of Greedy k-means++

Improved Outlier Robust Seeding for k-means

k-means++: few more steps yield constant approximation

Multi-Swap $k$-Means++

Provably noise-robust, regularised $k$-means clustering

A Faster $k$-means++ Algorithm

Are Easy Data Easy (for K-Means)

Robust $K$-Means-type Clustering for Noisy Data

Global $k$-means$++$: an effective relaxation of the global $k$-means clustering algorithm

K -Means: A Revisit

Semi-supervised K-means++

Computing $k$-means in mixed precision

On the Consistency of Exact and Approximate Nearest Neighbor with Noisy Data.

A K-MEANS CLUSTERING ALGORITHM WITH NOISE PROCESSING

Wide gaps and Kleinberg’s clustering axioms for k -means

K*-Means: an Effective and Efficient K-Means Clustering Algorithm

Scalable Kernel $K$-Means with Randomized Sketching: from Theory to Algorithm

Clustering Stable Instances of Euclidean k-means

Do you know what q-means?

Local Search k-means++ with Foresight