Clustering Mixtures with Almost Optimal Separation in Polynomial Time
Jerry Li,Allen Liu
DOI: https://doi.org/10.1137/22m1538788
2024-02-23
SIAM Journal on Computing
Abstract:SIAM Journal on Computing, Ahead of Print. We consider the problem of clustering mixtures of mean-separated Gaussians in high dimensions. We are given samples from a mixture of [math] identity covariance Gaussians, so that the minimum pairwise distance between any two pairs of means is at least [math], for some parameter [math], and the goal is to recover the ground truth clustering of these samples. It is folklore that separation [math] is both necessary and sufficient to recover a good clustering (say, with constant or [math] error), at least information-theoretically. However, the estimators which achieve this guarantee are inefficient. We give the first algorithm which runs in polynomial time in both [math] and the dimension [math], and which almost matches this guarantee. More precisely, we give an algorithm which takes polynomially many samples and time, and which can successfully recover a good clustering, so long as the separation is [math], for any [math]. Previously, polynomial time algorithms were only known for this problem when the separation was polynomial in [math], and all algorithms which could tolerate [math] separation required quasipolynomial time. We also extend our result to mixtures of translations of a distribution which satisfies the Poincaré inequality, under additional mild assumptions. Our main technical tool, which we believe is of independent interest, is a novel way to implicitly represent and estimate high degree moments of a distribution, which allows us to extract important information about high degree moments without ever writing down the full moment tensors explicitly.
computer science, theory & methods,mathematics, applied