Clustering Stable Instances of Euclidean k-means

Abhratanu Dutta,Aravindan Vijayaraghavan,Alex Wang
DOI: https://doi.org/10.48550/arXiv.1712.01241
2017-12-05
Abstract:The Euclidean k-means problem is arguably the most widely-studied clustering problem in machine learning. While the k-means objective is NP-hard in the worst-case, practitioners have enjoyed remarkable success in applying heuristics like Lloyd's algorithm for this problem. To address this disconnect, we study the following question: what properties of real-world instances will enable us to design efficient algorithms and prove guarantees for finding the optimal clustering? We consider a natural notion called additive perturbation stability that we believe captures many practical instances. Stable instances have unique optimal k-means solutions that do not change even when each point is perturbed a little (in Euclidean distance). This captures the property that the k-means optimal solution should be tolerant to measurement errors and uncertainty in the points. We design efficient algorithms that provably recover the optimal clustering for instances that are additive perturbation stable. When the instance has some additional separation, we show an efficient algorithm with provable guarantees that is also robust to outliers. We complement these results by studying the amount of stability in real datasets and demonstrating that our algorithm performs well on these benchmark datasets.
Machine Learning,Data Structures and Algorithms
What problem does this paper attempt to address?
The problem that this paper attempts to solve is, in Euclidean space, for clustering problems in practical applications, how to design effective algorithms to find the optimal clustering solutions and be able to provide theoretical guarantees for these solutions. Specifically, the paper focuses on the Euclidean \(k\)-means clustering problem, which is one of the most widely studied clustering problems in machine learning. Although the \(k\)-means objective function is NP - hard in the worst - case, practitioners have achieved remarkable success using heuristic methods such as Lloyd's algorithm. The paper aims to bridge the gap between theory and practice by studying the following question: Which properties of practical instances allow us to design efficient algorithms and prove guarantees for finding the optimal clustering? To answer this question, the paper introduces a natural concept called additive perturbation stability. The authors believe that this concept can capture the characteristics of instances in many practical applications. Stable instances have a unique optimal \(k\)-means solution, and this solution will not change even if each point is slightly moved in Euclidean distance. This reflects that the \(k\)-means optimal solution should be able to tolerate measurement errors and point uncertainties. Based on this concept, the paper designs algorithms that can effectively recover the optimal clustering of additively perturbation - stable instances. When the instances have additional separation, the paper also presents an efficient algorithm with provable guarantees, which is also robust to external outliers. In addition, the paper further supplements these results by studying the degree of stability in real - world datasets and the performance of its algorithms on such benchmark datasets. Overall, by introducing a new stability concept and designing corresponding algorithms, the paper provides new perspectives and tools for understanding and solving practical \(k\)-means clustering problems.