Abstract:We consider the classification problem of a high-dimensional mixture of two Gaussians with general covariance matrices. Using the replica method from statistical physics, we investigate the asymptotic behavior of a general class of regularized convex classifiers in the high-dimensional limit, where both the sample size $n$ and the dimension $p$ approach infinity while their ratio $\alpha=n/p$ remains fixed. Our focus is on the generalization error and variable selection properties of the estimators. Specifically, based on the distributional limit of the classifier, we construct a de-biased estimator to perform variable selection through an appropriate hypothesis testing procedure. Using $L_1$-regularized logistic regression as an example, we conducted extensive computational experiments to confirm that our analytical findings are consistent with numerical simulations in finite-sized systems. We also explore the influence of the covariance structure on the performance of the de-biased estimator.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the problem of classifying two Gaussian mixture distributions in high - dimensional data. Specifically, the authors focus on the asymptotic behavior of a class of regularized convex classifiers when the number of samples $n$ and the feature dimension $p$ both tend to infinity, but their ratio $\alpha=\frac{n}{p}$ remains fixed. The focus of the study is on the generalization error and variable selection performance of these estimators. ### Main problems 1. **High - dimensional classification problem**: - The paper studies how to use regularized convex classification methods (such as L1 - regularized logistic regression) to classify two Gaussian mixture distributions in high - dimensional situations (i.e., when both $n$ and $p$ are large). - The specific objective is to analyze the asymptotic behavior of these classifiers in the high - dimensional limit, especially their generalization error and variable selection performance. 2. **Variable selection and statistical inference**: - The authors propose a de - biased estimator for variable selection in high - dimensional settings. - Through an appropriate hypothesis - testing procedure, confidence intervals and p - values of parameters can be calculated, thereby achieving statistical inference. ### Methods - **Replica method**: - Use the replica method in statistical physics to study the asymptotic behavior of classifiers in the high - dimensional limit. - The replica method is a powerful tool that can provide accurate predictions about the joint distribution of estimators and true parameters. - **Numerical experiments**: - Verify theoretical results through a large number of numerical experiments, especially in systems of finite size. - The experiments include data - generation processes with different correlation structures to evaluate the accuracy of classifiers and variable selection performance. ### Main contributions 1. **Theoretical analysis**: - Derive the asymptotic distribution of regularized convex classifiers in the high - dimensional limit. - Propose a de - biased estimator and prove its effectiveness in variable selection. 2. **Numerical verification**: - Verify theoretical results through numerical experiments, indicating that theoretical predictions are consistent with actual simulation results. - Study the performance of classifiers under different correlation structures and sparsity levels. ### Conclusions - This paper provides a theoretical basis and practical tools for classification problems in high - dimensional data. - The proposed method is not only applicable to L1 - regularized logistic regression, but can also be extended to other regularized convex classification methods, such as support vector machines. - Future research directions include providing more rigorous theoretical proofs and applying the method to more types of classification problems.

Statistical Inference in Classification of High-dimensional Gaussian Mixture

Gaussian mixture density modeling and decomposition with weighted likelihood

The Breakdown of Gaussian Universality in Classification of High-dimensional Mixtures

Classification of Heavy-tailed Features in High Dimensions: a Superstatistical Approach

Statistical Inference on High Dimensional Gaussian Graphical Regression Models

Classifying Overlapping Gaussian Mixtures in High Dimensions: From Optimal Classifiers to Neural Nets

Probabilistic Classifiers with a Generalized Gaussian Scale Mixture Prior

Central limit theorems for high dimensional dependent data

High dimensional gaussian classification

Mixtures of Variance-Gamma Distributions

Variational Mixtures of Gaussian Processes for Classification.

Statistical Inference for High-Dimensional Generalized Linear Models With Binary Outcomes

Model-Free Statistical Inference on High-Dimensional Data

Robust Inference for High-dimensional Linear Models with Heavy-tailed Errors via Partial Gini Covariance

Estimating the mean and variance of a high-dimensional normal distribution using a mixture prior

Variable Selection for High Dimensional Gaussian Copula Regression Model: an Adaptive Hypothesis Testing Procedure.

Flexible Clustering with a Sparse Mixture of Generalized Hyperbolic Distributions

Discriminant analysis on high dimensional Gaussian copula model

Minimax Supervised Clustering in the Anisotropic Gaussian Mixture Model: A new take on Robust Interpolation

On universal inference in Gaussian mixture models

Theoretical Guarantees for Variational Inference with Fixed-Variance Mixture of Gaussians