Abstract:We investigate the existence of a fundamental computation-information gap for the problem of clustering a mixture of isotropic Gaussian in the high-dimensional regime, where the ambient dimension $p$ is larger than the number $n$ of points. The existence of a computation-information gap in a specific Bayesian high-dimensional asymptotic regime has been conjectured by arXiv:1610.02918 based on the replica heuristic from statistical physics. We provide evidence of the existence of such a gap generically in the high-dimensional regime $p \geq n$, by (i) proving a non-asymptotic low-degree polynomials computational barrier for clustering in high-dimension, matching the performance of the best known polynomial time algorithms, and by (ii) establishing that the information barrier for clustering is smaller than the computational barrier, when the number $K$ of clusters is large enough. These results are in contrast with the (moderately) low-dimensional regime $n \geq poly(p, K)$, where there is no computation-information gap for clustering a mixture of isotropic Gaussian. In order to prove our low-degree computational barrier, we develop sophisticated combinatorial arguments to upper-bound the mixed moments of the signal under a Bernoulli Bayesian model.

What problem does this paper attempt to address?

This paper discusses the computational-information gap problem in high-dimensional clustering, especially when the dimension p is larger than the sample size n. The researchers found that there exists a fundamental computational-information gap, where the computational barrier in clustering differs from the information barrier in specific high-dimensional limiting cases. Specifically, the paper proves that in the high-dimensional setting with p≥n, there exists a non-asymptotic low-degree polynomial computational lower bound that matches the performance of known polynomial-time algorithms. At the same time, they also demonstrate that the information barrier is smaller than the computational barrier when the number of clusters K is sufficiently large. The main contributions of the paper include: 1. Proving the existence of a minimum separation degree close to (3) in high-dimensional clustering problems, which is necessary for polynomial-time algorithms and may have a factor error of at most polylog(n). 2. Proving that the non-trivial information barrier for clustering is ∆2≳log(K)∨r pKlog(K) n, where the K-means algorithm is information-optimal. These results support and extend the assumption of the existence of a computational-information gap in the limit cases where p/n tends to γ > [(K/2−2)^{-2},+\infty). In low dimensions (n≥poly(p, K)), there is no such gap, but in high dimensions, especially when K> a certain constant K0, there is a significant computational-information gap. The paper uses complex combinatorial arguments to upper bound the mixed moments of signals in order to prove the low-degree polynomial computational lower bound, and discusses the differences between high-dimensional and moderately low-dimensional settings, as well as the reasons why clustering rates cannot be simply derived from estimation rates. In addition, the paper provides evidence for the computational power of low-degree polynomial models, which are strong models for many state-of-the-art algorithms, and offers strong evidence for identifying computational challenges.

Computation-information gap in high-dimensional clustering

Clustering Mixtures with Almost Optimal Separation in Polynomial Time

A new model for natural groupings in high-dimensional data

Rank-one matrix estimation: analysis of algorithmic and information theoretic limits by the spatial coupling method

Clustering Mixtures of Bounded Covariance Distributions Under Optimal Separation

Information Percolation and Cutoff for the Random-Cluster Model

Statistical Inference in Classification of High-dimensional Gaussian Mixture

Universal Lower Bounds and Optimal Rates: Achieving Minimax Clustering Error in Sub-Exponential Mixture Models

Clustering Based on Pairwise Distances When the Data is of Mixed Dimensions

Semidefinite programming on population clustering: a global analysis

The deterministic information bottleneck

Restricted percolation critical exponents in high dimensions

Flexible Clustering with a Sparse Mixture of Generalized Hyperbolic Distributions

Sharp optimal recovery in the two-component Gaussian Mixture Model

High-dimensional logistic entropy clustering

Semidefinite programming on population clustering: a local analysis

A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data

Maximum interpoint distance of high-dimensional random vectors

Variational Information Bottleneck for Unsupervised Clustering: Deep Gaussian Mixture Embedding

Gaussian universality for approximately polynomial functions of high-dimensional data

Generalized Information Bottleneck for Gaussian Variables