Abstract:Clustering is a fundamental problem in unsupervised learning, and has been studied widely both as a problem of learning mixture models and as an optimization problem. In this paper, we study clustering with respect the emph{k-median} objective function, a natural formulation of clustering in which we attempt to minimize the average distance to cluster centers. One of the main contributions of this paper is a simple but powerful sampling technique that we call emph{successive sampling} that could be of independent interest. We show that our sampling procedure can rapidly identify a small set of points (of size just O(klog{n/k})) that summarize the input points for the purpose of clustering. Using successive sampling, we develop an algorithm for the k-median problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with cost at most a constant factor times optimal. We also establish a lower bound of Omega(nk) on any randomized constant-factor approximation algorithm for the k-median problem that succeeds with even a negligible (say 1/100) probability. Thus we establish a tight time bound of Theta(nk) for the k-median problem for a wide range of values of k. The best previous upper bound for the problem was O(nk), where the O-notation hides polylogarithmic factors in n and k. The best previous lower bound of O(nk) applied only to deterministic k-median algorithms. While we focus our presentation on the k-median objective, all our upper bounds are valid for the k-means objective as well. In this context our algorithm compares favorably to the widely used k-means heuristic, which requires O(nk) time for just one iteration and provides no useful approximation guarantees.

An Approximate Algorithm for K-Means Problem Based on Input Points

A Novel K '-Means Algorithm For Clustering Analysis

Approximation Algorithms for K-Modes Clustering

Fast Approximate K-Means Via Cluster Closures

Optimal Time Bounds for Approximate Clustering

An improved K-means algorithm based on multiple feature points

r-Reference points based k-means algorithm

A local search algorithm for k-means with outliers

An Efficient Clustering Algorithm Based on Local Optimality of K-Means

A Novel Effective Distance Measure and a Relevant Algorithm for Optimizing the Initial Cluster Centroids of K-means

An Efficient K-Means Clustering Algorithm On Mapreduce

Multi-Prototypes Convex Merging Based K-Means Clustering Algorithm

Scalable Kernel Clustering: Approximate Kernel k-means

Research on K-Value Selection Method of K-Means Clustering Algorithm

Metaheuristic Strategy Based K-Means with the Iterative Self-Learning Framework

Subspace Clustering by Directly Solving Discriminative K-means

Stable Initialization Scheme for K-means Clustering

A Simple and Fast Algorithm for Global K-means Clustering

A Scalable Algorithm for Individually Fair K-means Clustering

K and starting means for k-means algorithm

Constant Approximation for K-Median and K-Means with Outliers Via Iterative Rounding