Coresets for Clustering in Euclidean Spaces: Importance Sampling is Nearly Optimal

Lingxiao Huang,Nisheeth K. Vishnoi
DOI: https://doi.org/10.1145/3357713.3384296
2020-01-01
Abstract:Given a collection of n points in ℝ d , the goal of the ( k , z )-clustering problem is to find a subset of k “centers” that minimizes the sum of the z -th powers of the Euclidean distance of each point to the closest center. Special cases of the ( k , z )-clustering problem include the k -median and k -means problems. Our main result is a unified two-stage importance sampling framework that constructs an ε-coreset for the ( k , z )-clustering problem. Compared to the results for ( k , z )-clustering in [Feldman and Langberg, STOC 2011], our framework saves a ε 2 d factor in the coreset size. Compared to the results for ( k , z )-clustering in [Sohler and Woodruff, FOCS 2018], our framework saves a poly( k ) factor in the coreset size and avoids the exp( k /ε) term in the construction time. Specifically, our coreset for k -median ( z =1) has size Õ(ε −4 k ) which, when compared to the result in [Sohler and Woodruff, STOC 2018], saves a k factor in the coreset size. Our algorithmic results rely on a new dimensionality reduction technique that connects two well-known shape fitting problems: subspace approximation and clustering, and may be of independent interest. We also provide a size lower bound of Ω( k · min{2 z /20 , d }) for a 0.01-coreset for ( k , z )-clustering, which has a linear dependence of size on k and an exponential dependence on z that matches our algorithmic results.
What problem does this paper attempt to address?