Classification of Heavy-tailed Features in High Dimensions: a Superstatistical Approach

Urte Adomaityte,Gabriele Sicuro,Pierpaolo Vivo
2023-11-01
Abstract:We characterise the learning of a mixture of two clouds of data points with generic centroids via empirical risk minimisation in the high dimensional regime, under the assumptions of generic convex loss and convex regularisation. Each cloud of data points is obtained via a double-stochastic process, where the sample is obtained from a Gaussian distribution whose variance is itself a random parameter sampled from a scalar distribution $\varrho$. As a result, our analysis covers a large family of data distributions, including the case of power-law-tailed distributions with no covariance, and allows us to test recent "Gaussian universality" claims. We study the generalisation performance of the obtained estimator, we analyse the role of regularisation, and we analytically characterise the separability transition.
Machine Learning,Disordered Systems and Neural Networks,Statistics Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to conduct effective learning through Empirical Risk Minimization (ERM) in high - dimensional data classification tasks when data points come from a mixture model with heavy - tailed distributions. Specifically, the paper focuses on how to classify two data clouds (each cloud is generated by a Gaussian distribution with a central vector \(\mu\) and a random variance \(\Delta\)) in the high - dimensional limit when the number of samples \(n\) and the data dimension \(d\) both tend to infinity and their ratio \(\alpha = n/d\) remains fixed. ### Main problems 1. **Influence of non - Gaussian data distributions**: Traditional high - dimensional classification research usually assumes that data points follow a Gaussian distribution or a Gaussian mixture distribution. However, actual data often contains structural features and heavy - tailed distributions, and these features may have an important impact on the learning process. Therefore, the paper attempts to explore the impact of non - Gaussian data distributions (especially heavy - tailed distributions) on classification performance. 2. **Role of regularization**: The paper also studies the role of regularization in non - Gaussian data classification. Specifically, the paper analyzes the impact of different regularization intensities on classification performance and compares the results with those in the Gaussian data case. 3. **Separability threshold**: The paper explores when the data set becomes linearly inseparable under non - Gaussian data distributions. This involves determining a critical sample complexity \(\alpha^*\), below which the data set can be perfectly linearly separated. ### Research methods - **Superstatistical method**: The paper adopts a "superstatistical" method, that is, superimposing a random distribution of variances on the basis of the Gaussian distribution. This construction allows researchers to consider a large class of non - Gaussian distributions, including power - law distributions and Cauchy distributions, etc. - **Replica method**: Use the replica method to derive the asymptotic characteristics of the Empirical Risk Minimization estimator. This method is widely used in statistical physics to deal with complex optimization problems. ### Main contributions 1. **Asymptotic analysis**: The paper provides the asymptotic characteristics of the Empirical Risk Minimization estimator for classification tasks on non - Gaussian mixture models in the high - dimensional limit. These results cover not only covariates with infinite variances but also any convex loss functions and convex regularizations. 2. **Performance analysis**: Through different convex loss functions (such as quadratic loss and logistic loss) and ridge regularization, the paper analyzes the performance of classification tasks. In particular, for two balanced non - Gaussian distribution clusters, the optimal ridge regularization intensity \(\lambda^*\) is finite, which is in contrast to \(\lambda^*\to\infty\) in the Gaussian case. 3. **Separability threshold**: The paper derives the separability threshold \(\alpha^*\) of the data set under a large class of non - Gaussian data distributions. This result generalizes the known asymptotic properties of Gaussian cloud separability. 4. **Bayesian optimal performance**: Under certain moment conditions, the paper derives the Bayesian optimal performance of binary classification tasks in the case of symmetric central points. ### Experimental verification The paper verifies the accuracy of theoretical predictions through numerical experiments. The experimental results show that for different shape parameters \(a\) and sample complexity \(\alpha\), the theoretical predictions are highly consistent with the numerical experimental results. In particular, under non - Gaussian data distributions, the classification performance is significantly different from the results under Gaussian data distributions, thus verifying the failure of the "Gaussian universality principle" under heavy - tailed distributions. In summary, by introducing the superstatistical method, this paper systematically studies the impact of non - Gaussian data distributions on high - dimensional classification tasks and provides a new perspective for understanding machine - learning behaviors on complex data sets.