Abstract:The Euclidean k-means problem is arguably the most widely-studied clustering problem in machine learning. While the k-means objective is NP-hard in the worst-case, practitioners have enjoyed remarkable success in applying heuristics like Lloyd's algorithm for this problem. To address this disconnect, we study the following question: what properties of real-world instances will enable us to design efficient algorithms and prove guarantees for finding the optimal clustering? We consider a natural notion called additive perturbation stability that we believe captures many practical instances. Stable instances have unique optimal k-means solutions that do not change even when each point is perturbed a little (in Euclidean distance). This captures the property that the k-means optimal solution should be tolerant to measurement errors and uncertainty in the points. We design efficient algorithms that provably recover the optimal clustering for instances that are additive perturbation stable. When the instance has some additional separation, we show an efficient algorithm with provable guarantees that is also robust to outliers. We complement these results by studying the amount of stability in real datasets and demonstrating that our algorithm performs well on these benchmark datasets.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is, in Euclidean space, for clustering problems in practical applications, how to design effective algorithms to find the optimal clustering solutions and be able to provide theoretical guarantees for these solutions. Specifically, the paper focuses on the Euclidean \(k\)-means clustering problem, which is one of the most widely studied clustering problems in machine learning. Although the \(k\)-means objective function is NP - hard in the worst - case, practitioners have achieved remarkable success using heuristic methods such as Lloyd's algorithm. The paper aims to bridge the gap between theory and practice by studying the following question: Which properties of practical instances allow us to design efficient algorithms and prove guarantees for finding the optimal clustering? To answer this question, the paper introduces a natural concept called additive perturbation stability. The authors believe that this concept can capture the characteristics of instances in many practical applications. Stable instances have a unique optimal \(k\)-means solution, and this solution will not change even if each point is slightly moved in Euclidean distance. This reflects that the \(k\)-means optimal solution should be able to tolerate measurement errors and point uncertainties. Based on this concept, the paper designs algorithms that can effectively recover the optimal clustering of additively perturbation - stable instances. When the instances have additional separation, the paper also presents an efficient algorithm with provable guarantees, which is also robust to external outliers. In addition, the paper further supplements these results by studying the degree of stability in real - world datasets and the performance of its algorithms on such benchmark datasets. Overall, by introducing a new stability concept and designing corresponding algorithms, the paper provides new perspectives and tools for understanding and solving practical \(k\)-means clustering problems.

Clustering Stable Instances of Euclidean k-means

Exact Algorithms and Lower Bounds for Stable Instances of Euclidean k-Means

Scalable K-Means for Large-Scale Clustering.

AN EXACT ALGORITHM FOR STABLE INSTANCES OF THE K-Means PROBLEM WITH PENALTIES IN FIXED-DIMENSIONAL EUCLIDEAN SPACE

Wide gaps and Kleinberg’s clustering axioms for k -means

Computing $k$-means in mixed precision

A Scalable Algorithm for Individually Fair K-means Clustering

Wide Gaps and Clustering Axioms

When do birds of a feather flock together?

EPTAS for $k$-means Clustering of Affine Subspaces

Clustering What Matters in Constrained Settings

Improved Algorithms for Clustering with Outliers.

Optimal Time Bounds for Approximate Clustering

Relax and Merge: A Simple Yet Effective Framework for Solving Fair $k$-Means and $k$-sparse Wasserstein Barycenter Problems

t-k-means: A k-means Variant with Robustness and Stability

Clustering with Distributed Data

Hybrid k-Clustering: Blending k-Median and k-Center

Beyond K-Means++: Towards Better Cluster Exploration with Geometrical Information

When Do Birds of a Feather Flock Together? K-Means, Proximity, and Conic Programming.

Relax, no need to round: integrality of clustering formulations

Clustering Stability-Based Evolutionary K-Means