Statistical Inference in Classification of High-dimensional Gaussian Mixture

Hanwen Huang,Peng Zeng
2024-10-26
Abstract:We consider the classification problem of a high-dimensional mixture of two Gaussians with general covariance matrices. Using the replica method from statistical physics, we investigate the asymptotic behavior of a general class of regularized convex classifiers in the high-dimensional limit, where both the sample size $n$ and the dimension $p$ approach infinity while their ratio $\alpha=n/p$ remains fixed. Our focus is on the generalization error and variable selection properties of the estimators. Specifically, based on the distributional limit of the classifier, we construct a de-biased estimator to perform variable selection through an appropriate hypothesis testing procedure. Using $L_1$-regularized logistic regression as an example, we conducted extensive computational experiments to confirm that our analytical findings are consistent with numerical simulations in finite-sized systems. We also explore the influence of the covariance structure on the performance of the de-biased estimator.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the problem of classifying two Gaussian mixture distributions in high - dimensional data. Specifically, the authors focus on the asymptotic behavior of a class of regularized convex classifiers when the number of samples \(n\) and the feature dimension \(p\) both tend to infinity, but their ratio \(\alpha=\frac{n}{p}\) remains fixed. The focus of the study is on the generalization error and variable selection performance of these estimators. ### Main problems 1. **High - dimensional classification problem**: - The paper studies how to use regularized convex classification methods (such as L1 - regularized logistic regression) to classify two Gaussian mixture distributions in high - dimensional situations (i.e., when both \(n\) and \(p\) are large). - The specific objective is to analyze the asymptotic behavior of these classifiers in the high - dimensional limit, especially their generalization error and variable selection performance. 2. **Variable selection and statistical inference**: - The authors propose a de - biased estimator for variable selection in high - dimensional settings. - Through an appropriate hypothesis - testing procedure, confidence intervals and p - values of parameters can be calculated, thereby achieving statistical inference. ### Methods - **Replica method**: - Use the replica method in statistical physics to study the asymptotic behavior of classifiers in the high - dimensional limit. - The replica method is a powerful tool that can provide accurate predictions about the joint distribution of estimators and true parameters. - **Numerical experiments**: - Verify theoretical results through a large number of numerical experiments, especially in systems of finite size. - The experiments include data - generation processes with different correlation structures to evaluate the accuracy of classifiers and variable selection performance. ### Main contributions 1. **Theoretical analysis**: - Derive the asymptotic distribution of regularized convex classifiers in the high - dimensional limit. - Propose a de - biased estimator and prove its effectiveness in variable selection. 2. **Numerical verification**: - Verify theoretical results through numerical experiments, indicating that theoretical predictions are consistent with actual simulation results. - Study the performance of classifiers under different correlation structures and sparsity levels. ### Conclusions - This paper provides a theoretical basis and practical tools for classification problems in high - dimensional data. - The proposed method is not only applicable to L1 - regularized logistic regression, but can also be extended to other regularized convex classification methods, such as support vector machines. - Future research directions include providing more rigorous theoretical proofs and applying the method to more types of classification problems.