High-Dimensional PCA Revisited: Insights from General Spiked Models and Data Normalization Effects

Yanqing Yin,Wang Zhou
2024-08-25
Abstract:Principal Component Analysis (PCA) is a critical tool for dimensionality reduction and data analysis. This paper revisits PCA through the lens of generalized spiked covariance and correlation models, which allow for more realistic and complex data structures. We explore the asymptotic properties of the sample principal components (PCs) derived from both the sample covariance and correlation matrices, focusing on how data normalization-an essential step for scale-invariant analysis-affects these properties. Our results reveal that while normalization does not alter the first-order limits of spiked eigenvalues and eigenvectors, it significantly influences their second-order behavior. We establish new theoretical findings, including a joint central limit theorem for bilinear forms of the sample covariance matrix's resolvent and diagonal entries, providing a robust framework for understanding spiked models in high dimensions. Our theoretical results also reveal an intriguing phenomenon regarding the effect of data normalization when the variances of covariates are equal. Specifically, they suggest that high-dimensional PCA based on the correlation matrix may not only perform comparably to, but potentially even outperform, PCA based on the covariance matrix-particularly when the leading principal component is sufficiently large. This study not only extends the existing literature on spiked models but also offers practical guidance for applying PCA in real-world scenarios, particularly when dealing with normalized data.
Statistics Theory
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper "Revisiting High - Dimensional Principal Component Analysis: Generalized Spiked Models and the Effect of Data Standardization" aims to solve the following problems: 1. **Theoretical basis of high - dimensional principal component analysis (PCA)**: - Research the asymptotic properties of sample principal components (PCs) of PCA in the context of high - dimensional data, especially within the framework of the generalized spiked model. - Explore the joint asymptotic behavior of sample PCs when the non - spiked part deviates from the identity matrix, in order to gain a deeper understanding of the performance of PCA in complex real - world scenarios. 2. **The impact of data standardization on PCA**: - Analyze the impact of data standardization on the first - order and second - order properties of sample PCs, especially in high - dimensional settings. - Provide theoretical guidance on applying PCA to standardized data, making it more effective and reliable when dealing with high - dimensional data. ### Specific problems and solutions 1. **Theoretical challenges of high - dimensional PCA**: - As the dimension increases, the eigenvalue differences between the sample covariance matrix \( S \) and the population covariance matrix \( \Sigma \) become more significant, making it difficult to directly apply PCA. - By introducing the generalized spiked model, assume that some eigenvalues of the population covariance matrix are clearly separated from other eigenvalues, and then study the asymptotic properties of these eigenvalues and eigenvectors. 2. **The impact of data standardization**: - Data standardization is a commonly used step in multivariate analysis to eliminate the influence of variable scales. However, the impact of standardization on the performance and accuracy of PCA still needs further exploration. - Through rigorous theoretical analysis, this paper reveals the specific impact of standardization on the first - order and second - order properties of sample PCs. In particular, when the variances of covariates are equal, PCA based on the correlation matrix may be superior to PCA based on the covariance matrix. ### Main contributions 1. **Introduction of a new joint central limit theorem (CLT)**: - Provide a new theoretical framework for the joint CLT of the bilinear form of the sample covariance matrix and its diagonal elements, simplifying the derivation process of key theoretical results. 2. **Comprehensive asymptotic analysis of the generalized spiked model**: - Expand the existing theoretical framework to cover a wider range of spiked covariance models, where the non - spiked part is not limited to the identity matrix. - Derive in detail the asymptotic distribution of the projection of the sample spiked eigenvector in any direction, providing a complete understanding of the behavior of sample PCs. 3. **Theoretical progress in scale - invariant PCA**: - In the context of standardized data, provide the joint limit distribution of spiked eigenvalues and the asymptotic distribution of the projection of the sample spiked eigenvector in any direction. - Go beyond the limitations of existing research, allowing the non - spiked part to be arbitrary and the spiked and non - spiked parts to be non - independent, making the model more flexible and realistic. ### Conclusion Through in - depth theoretical analysis, this paper not only expands the theoretical basis of high - dimensional PCA, but also provides important guidance for practical applications, especially when dealing with standardized data, improving the accuracy and reliability of PCA.