Computation-information gap in high-dimensional clustering

Bertrand Even,Christophe Giraud,Nicolas Verzelen
2024-02-28
Abstract:We investigate the existence of a fundamental computation-information gap for the problem of clustering a mixture of isotropic Gaussian in the high-dimensional regime, where the ambient dimension $p$ is larger than the number $n$ of points. The existence of a computation-information gap in a specific Bayesian high-dimensional asymptotic regime has been conjectured by arXiv:1610.02918 based on the replica heuristic from statistical physics. We provide evidence of the existence of such a gap generically in the high-dimensional regime $p \geq n$, by (i) proving a non-asymptotic low-degree polynomials computational barrier for clustering in high-dimension, matching the performance of the best known polynomial time algorithms, and by (ii) establishing that the information barrier for clustering is smaller than the computational barrier, when the number $K$ of clusters is large enough. These results are in contrast with the (moderately) low-dimensional regime $n \geq poly(p, K)$, where there is no computation-information gap for clustering a mixture of isotropic Gaussian. In order to prove our low-degree computational barrier, we develop sophisticated combinatorial arguments to upper-bound the mixed moments of the signal under a Bernoulli Bayesian model.
Statistics Theory
What problem does this paper attempt to address?
This paper discusses the computational-information gap problem in high-dimensional clustering, especially when the dimension p is larger than the sample size n. The researchers found that there exists a fundamental computational-information gap, where the computational barrier in clustering differs from the information barrier in specific high-dimensional limiting cases. Specifically, the paper proves that in the high-dimensional setting with p≥n, there exists a non-asymptotic low-degree polynomial computational lower bound that matches the performance of known polynomial-time algorithms. At the same time, they also demonstrate that the information barrier is smaller than the computational barrier when the number of clusters K is sufficiently large. The main contributions of the paper include: 1. Proving the existence of a minimum separation degree close to (3) in high-dimensional clustering problems, which is necessary for polynomial-time algorithms and may have a factor error of at most polylog(n). 2. Proving that the non-trivial information barrier for clustering is ∆2≳log(K)∨r pKlog(K) n, where the K-means algorithm is information-optimal. These results support and extend the assumption of the existence of a computational-information gap in the limit cases where p/n tends to γ > [(K/2−2)^{-2},+\infty). In low dimensions (n≥poly(p, K)), there is no such gap, but in high dimensions, especially when K> a certain constant K0, there is a significant computational-information gap. The paper uses complex combinatorial arguments to upper bound the mixed moments of signals in order to prove the low-degree polynomial computational lower bound, and discusses the differences between high-dimensional and moderately low-dimensional settings, as well as the reasons why clustering rates cannot be simply derived from estimation rates. In addition, the paper provides evidence for the computational power of low-degree polynomial models, which are strong models for many state-of-the-art algorithms, and offers strong evidence for identifying computational challenges.