Abstract:With the development of high-throughput technologies, principal component analysis (PCA) in the high-dimensional regime is of great interest. Most of the existing theoretical and methodological results for high-dimensional PCA are based on the spiked population model in which all the population eigenvalues are equal except for a few large ones. Due to the presence of local correlation among features, however, this assumption may not be satisfied in many real-world datasets. To address this issue, we investigated the asymptotic behaviors of PCA under the generalized spiked population model. Based on the theoretical results, we proposed a series of methods for the consistent estimation of population eigenvalues, angles between the sample and population eigenvectors, correlation coefficients between the sample and population principal component (PC) scores, and the shrinkage bias adjustment for the predicted PC scores. Using numerical experiments and real data examples from the genetics literature, we showed that our methods can greatly reduce bias and improve prediction accuracy.
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the asymptotic properties of principal component analysis (PCA) in high - dimensional data under the generalized spiked population model, and how to consistently estimate the population eigenvalues, the angles between sample and population eigenvectors, and the correlation coefficients between sample and population principal component (PC) scores under this model, and adjust the shrinkage bias in the predicted PC scores. Specifically:
1. **PCA problems in high - dimensional data**: With the development of high - throughput technologies, PCA in high - dimensional data has become very important. However, in high - dimensional settings, sample eigenvalues and eigenvectors are no longer consistent estimators of population eigenvalues and eigenvectors, and the principal component scores predicted based on sample eigenvectors may be systematically biased towards zero.
2. **Limitations of the spiked population model**: Most of the existing theories and methods are based on the spiked population model, that is, all population eigenvalues are equal, with only a few being larger. However, due to local correlations between features, this assumption may not hold in many real - world datasets.
3. **Application of the generalized spiked population model**: To solve the above problems, the authors studied the PCA asymptotic behavior under the generalized spiked population model. This model allows non - spiked eigenvalues to be unequal, thus being more in line with the characteristics of real - world datasets.
4. **Proposed methods**: Based on the theoretical results, the authors proposed a series of methods to:
- Consistently estimate population eigenvalues
- Estimate the angles between sample and population eigenvectors
- Estimate the correlation coefficients between sample and population principal component scores
- Adjust the shrinkage bias in the predicted principal component scores
5. **Verification and application**: Through numerical experiments and practical data examples in the genetics literature, the authors showed that these methods can significantly reduce bias and improve prediction accuracy.
In summary, this paper aims to solve the problems of the asymptotic properties and applications of PCA in high - dimensional data under the generalized spiked population model, and proposes improved methods to enhance the accuracy of estimation and prediction.