Ilias Diakonikolas,Samuel B. Hopkins,Ankit Pensia,Stefan Tiegel
Abstract:We prove that there is a universal constant $C>0$ so that for every $d \in \mathbb N$, every centered subgaussian distribution $\mathcal D$ on $\mathbb R^d$, and every even $p \in \mathbb N$, the $d$-variate polynomial $(Cp)^{p/2} \cdot \|v\|_{2}^p - \mathbb E_{X \sim \mathcal D} \langle v,X\rangle^p$ is a sum of square polynomials. This establishes that every subgaussian distribution is \emph{SoS-certifiably subgaussian} -- a condition that yields efficient learning algorithms for a wide variety of high-dimensional statistical tasks. As a direct corollary, we obtain computationally efficient algorithms with near-optimal guarantees for the following tasks, when given samples from an arbitrary subgaussian distribution: robust mean estimation, list-decodable mean estimation, clustering mean-separated mixture models, robust covariance-aware mean estimation, robust covariance estimation, and robust linear regression. Our proof makes essential use of Talagrand's generic chaining/majorizing measures theorem.
What problem does this paper attempt to address?
### What problem does this paper attempt to solve?
This paper aims to solve how to prove that all subgaussian distributions have verifiable subgaussian properties, and then derive a series of efficient algorithm applications. Specifically, the authors attempt to answer the following core question:
**Question 1.5: Can we characterize all verifiable subgaussian distributions?**
#### Background and Motivation
In robust statistics, it is an important goal to design computationally efficient estimators to achieve approximately optimal accuracy in the presence of a large amount of contaminated data. A typical problem is robust mean estimation: given a set of data points \(S\) and a contamination parameter \(\epsilon\), approximately \((1 - \epsilon)\) of the data points come from an unknown distribution \(P\), and the remaining \(\epsilon\) data points may be arbitrarily or adversarially chosen. The goal is to estimate the mean \(\mu\) of the unknown distribution \(P\).
For Gaussian distributions, previous studies have shown that polynomial - time algorithms can be designed to achieve an accuracy of \(\tilde{O}(\epsilon)\) within the \(\ell_2\) error range. However, the Gaussian assumption is often not sufficient to accurately model many practical application scenarios. Therefore, researchers turn to more general distribution families, such as subgaussian distributions.
Subgaussian distributions are a widely studied non - parametric distribution family, and the tail probability of their linear projections decays at least as fast as that of Gaussian distributions. Theoretically, for any subgaussian distribution, the mean can be robustly estimated within an error range of \(\tilde{O}(\epsilon)\). However, previous work has shown that for robust mean estimation of general subgaussian distributions, the best - known error guarantee is \(O(\epsilon^{1/2})\), which is not as ideal as that of Gaussian distributions.
#### Main Contributions
The authors solve the above problems by proving that all subgaussian distributions are verifiable subgaussian distributions. Specifically, they prove the following theorem:
**Theorem 1.6 (Verifiability of Subgaussian Distributions)**: There exists a universal constant \(C>0\) such that for any \(s\)-subgaussian random vector \(X\sim P\) in \(\mathbb{R}^d\), \(P\) is \((Cs\sqrt{m}, m)\)-verifiably bounded for any even number \(m\). In particular, \(P\) is \(Cs\)-verifiable subgaussian.
This result implies that all subgaussian distributions have verifiable subgaussian properties and can be applied to a series of high - dimensional statistical tasks, including but not limited to:
- **Robust Mean Estimation**: Subgaussian distributions under the \(\ell_2\) norm.
- **List - Decoding Mean Estimation**: Subgaussian distributions under the \(\ell_2\) norm.
- **Mixture Model Clustering**: Subgaussian distributions under the mean - separation assumption.
- **Robust Covariance Estimation**: Hyper - contractive subgaussian distributions under the relative spectral norm.
- **Robust Linear Regression**: Hyper - contractive subgaussian distributions under arbitrary noise.
These results not only provide new theoretical insights but also have important significance in practical applications, especially when dealing with high - dimensional data, and can provide more efficient and robust algorithms.
#### Technical Overview
To prove Theorem 1.6, the authors use duality and Talagrand's generic chaining method. Through duality, they transform the problem into analyzing the expected upper bound of the empirical process. Then, by using the chaining method, they transform the nonlinear empirical process into a linear empirical process and apply the concentration inequality of the Gaussian process to complete the proof.
In summary, this paper provides a new theoretical basis and an efficient algorithm framework for a series of high - dimensional statistical tasks by proving that all subgaussian distributions are verifiable subgaussian distributions.