Asymptotic properties of Principal Component Analysis and shrinkage-bias adjustment under the Generalized Spiked Population model

Rounak Dey,Seunggeun Lee

DOI: https://doi.org/10.1016/j.jmva.2019.02.007

2016-07-29

Abstract:With the development of high-throughput technologies, principal component analysis (PCA) in the high-dimensional regime is of great interest. Most of the existing theoretical and methodological results for high-dimensional PCA are based on the spiked population model in which all the population eigenvalues are equal except for a few large ones. Due to the presence of local correlation among features, however, this assumption may not be satisfied in many real-world datasets. To address this issue, we investigated the asymptotic behaviors of PCA under the generalized spiked population model. Based on the theoretical results, we proposed a series of methods for the consistent estimation of population eigenvalues, angles between the sample and population eigenvectors, correlation coefficients between the sample and population principal component (PC) scores, and the shrinkage bias adjustment for the predicted PC scores. Using numerical experiments and real data examples from the genetics literature, we showed that our methods can greatly reduce bias and improve prediction accuracy.

Statistics Theory,Machine Learning

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the asymptotic properties of principal component analysis (PCA) in high - dimensional data under the generalized spiked population model, and how to consistently estimate the population eigenvalues, the angles between sample and population eigenvectors, and the correlation coefficients between sample and population principal component (PC) scores under this model, and adjust the shrinkage bias in the predicted PC scores. Specifically: 1. **PCA problems in high - dimensional data**: With the development of high - throughput technologies, PCA in high - dimensional data has become very important. However, in high - dimensional settings, sample eigenvalues and eigenvectors are no longer consistent estimators of population eigenvalues and eigenvectors, and the principal component scores predicted based on sample eigenvectors may be systematically biased towards zero. 2. **Limitations of the spiked population model**: Most of the existing theories and methods are based on the spiked population model, that is, all population eigenvalues are equal, with only a few being larger. However, due to local correlations between features, this assumption may not hold in many real - world datasets. 3. **Application of the generalized spiked population model**: To solve the above problems, the authors studied the PCA asymptotic behavior under the generalized spiked population model. This model allows non - spiked eigenvalues to be unequal, thus being more in line with the characteristics of real - world datasets. 4. **Proposed methods**: Based on the theoretical results, the authors proposed a series of methods to: - Consistently estimate population eigenvalues - Estimate the angles between sample and population eigenvectors - Estimate the correlation coefficients between sample and population principal component scores - Adjust the shrinkage bias in the predicted principal component scores 5. **Verification and application**: Through numerical experiments and practical data examples in the genetics literature, the authors showed that these methods can significantly reduce bias and improve prediction accuracy. In summary, this paper aims to solve the problems of the asymptotic properties and applications of PCA in high - dimensional data under the generalized spiked population model, and proposes improved methods to enhance the accuracy of estimation and prediction.

Asymptotic properties of Principal Component Analysis and shrinkage-bias adjustment under the Generalized Spiked Population model

High-Dimensional PCA Revisited: Insights from General Spiked Models and Data Normalization Effects

On the asymptotic properties of product-PCA under the high-dimensional setting

The High-Dimensional Asymptotics of Principal Component Regression

Optimal Differentially Private PCA and Estimation for Spiked Covariance Matrices

Optimal Spectral Shrinkage and PCA With Heteroscedastic Noise

The Asymptotic Properties of the Extreme Eigenvectors of High-dimensional Generalized Spiked Covariance Model

Statistical inference for principal components of spiked covariance matrices

On spiked eigenvalues of a renormalized sample covariance matrix from multi-population

Robust Covariance Estimation for Distributed Principal Component Analysis

Bayes-optimal limits in structured PCA, and how to reach them

Long-term followup of spinal cord injury patients managed by intermittent catheterization.

Asymptotic theory of principal component analysis for time series data with cautionary comments

Power Analysis of Principal Components Regression in Genetic Association Studies.

Debiasing Sample Loadings and Scores in Exponential Family PCA for Sparse Count Data

Optimal Eigenvalue Shrinkage in the Semicircle Limit

Dynamic Principal Subspaces with Sparsity in High Dimensions

When and why are principal component scores a good tool for visualizing high-dimensional data?

Generalized probabilistic principal component analysis of correlated data

On General Adaptive Sparse Principal Component Analysis

An augmented Lagrangian approach for sparse principal component analysis