Augmented Estimation of Principal Component Subspace in High Dimensions

Dongsun Yoon,Sungkyu Jung
2024-11-25
Abstract:In this paper, we introduce a novel estimator, called the Augmented Principal Component Subspace, for estimating the principal component subspace for high-dimensional low-sample size data with spiked covariance structure. Our approach augments the naive sample principal component subspace by incorporating additional information from predefined reference directions. Augmented principal component subspace asymptotically reduces every principal angle between the estimated and the true subspaces, thereby outperforming the naive estimator regardless of the metric used. The estimator's efficiency is validated both analytically and through numerical studies, demonstrating significant improvements in accuracy when the reference directions contain substantial information about the true principal component subspace. Additionally, we suggest Augmented PCA using this estimator and explore connections between our method and the recently proposed James-Stein estimator for principal component directions.
Statistics Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to estimate the principal component subspace more accurately in high - dimensional and low - sample - size data. Specifically, the author proposes a new estimator - the Augmented Principal Component Subspace, which improves the traditional sample principal component subspace estimation by combining predefined reference directions. ### Background and Problem Description In high - dimensional data, when the sample size \(n\) is much smaller than the feature dimension \(p\), the traditional principal component analysis (PCA) method may perform poorly. In particular, when the data has a spiked covariance structure, that is, the variances of the first few principal components are significantly higher than those of other components, the traditional method often fails to accurately estimate the directions of these principal components. ### Main Contributions of the Paper 1. **Introduction of the Augmented Principal Component Subspace**: - The author proposes a new estimator, called the Augmented Principal Component Subspace, which improves the traditional sample principal component subspace estimation by combining predefined reference directions. - These reference directions contain information about the true principal component subspace, thereby improving the accuracy of the estimation. 2. **Theoretical Analysis**: - The author proves that the Augmented Principal Component Subspace estimator can asymptotically reduce the error between each principal angle in the case of high - dimensional and low - sample - size data, and thus is superior to the traditional estimator. - Through theoretical derivation, the author shows the superior performance of the augmented estimator under different metrics. 3. **Numerical Verification**: - The author verifies the effectiveness of the augmented estimator through numerical experiments. The experimental results show that when the reference directions contain a large amount of information about the true principal component subspace, the accuracy of the augmented estimator is significantly improved. ### Method Overview 1. **Data Setup and Assumptions**: - Assume that the data \(X_1,\ldots,X_n\in\mathbb{R}^p\) are drawn from an absolutely continuous distribution, with a mean vector of \(\mu\) and a covariance matrix of \(\Sigma\). - The eigen - decomposition of the covariance matrix \(\Sigma\) is \(\Sigma = U\Lambda U^\top=\sum_{i = 1}^p\lambda_i u_i u_i^\top\), where \(U = [u_1,\ldots,u_p]\) is the matrix of eigenvectors, \(\Lambda\) is a diagonal matrix, and its diagonal elements are the eigenvalues \(\lambda_1\geq\cdots\geq\lambda_p\) arranged in descending order. 2. **Augmented Principal Component Subspace**: - Define the signal subspace \(S=\text{span}(\hat{u}_1,\ldots,\hat{u}_m,\nu_1,\ldots,\nu_L)\), where \(\hat{u}_1,\ldots,\hat{u}_m\) are the sample principal component directions, and \(\nu_1,\ldots,\nu_L\) are the predefined reference directions. - Estimate the optimal subspace in the signal subspace by negatively ridged discriminant type vectors. 3. **Algorithm Implementation**: - Propose an augmented PCA algorithm, which uses the improved accuracy of the augmented principal component subspace to perform principal component analysis. ### Conclusion This paper solves the problem of inaccurate estimation of the principal component subspace in high - dimensional and low - sample - size data by introducing the Augmented Principal Component Subspace estimator. Through theoretical analysis and numerical experiments, the effectiveness and superiority of this method are proved.