Abstract:This paper is concerned with estimating the column subspace of a low-rank matrix $\boldsymbol{X}^\star \in \mathbb{R}^{n_1\times n_2}$ from contaminated data. How to obtain optimal statistical accuracy while accommodating the widest range of signal-to-noise ratios (SNRs) becomes particularly challenging in the presence of heteroskedastic noise and unbalanced dimensionality (i.e., $n_2\gg n_1$). While the state-of-the-art algorithm $\textsf{HeteroPCA}$ emerges as a powerful solution for solving this problem, it suffers from "the curse of ill-conditioning," namely, its performance degrades as the condition number of $\boldsymbol{X}^\star$ grows. In order to overcome this critical issue without compromising the range of allowable SNRs, we propose a novel algorithm, called $\textsf{Deflated-HeteroPCA}$, that achieves near-optimal and condition-number-free theoretical guarantees in terms of both $\ell_2$ and $\ell_{2,\infty}$ statistical accuracy. The proposed algorithm divides the spectrum of $\boldsymbol{X}^\star$ into well-conditioned and mutually well-separated subblocks, and applies $\textsf{HeteroPCA}$ to conquer each subblock successively. Further, an application of our algorithm and theory to two canonical examples -- the factor model and tensor PCA -- leads to remarkable improvement for each application.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is how to accurately estimate the column subspace of the low - rank matrix $X^\star\in\mathbb{R}^{n_1\times n_2}$ in the presence of heteroscedastic noise and unbalanced dimensions. Specifically, the author focuses on how to overcome "the curse of ill - conditioning" under these conditions, that is, when the condition number of $X^\star$ is large, the performance of existing algorithms (such as HeteroPCA) drops significantly. ### Problem Background In high - dimensional data analysis, principal component analysis (PCA) is a commonly used technique for identifying low - dimensional subspaces that can best capture the information of a large number of high - dimensional data points. However, when dealing with data with heteroscedastic noise, existing PCA methods face challenges, especially in the case of unbalanced dimensions (i.e., $n_2\gg n_1$). In this case, the signal - to - noise ratio (SNR) may be very low, causing traditional low - rank matrix estimation algorithms to fail. ### Specific Problems 1. **Heteroscedastic Noise and Unbalanced Dimensions**: When the variance of the noise changes with the position and the dimensions are highly unbalanced, traditional PCA methods are difficult to obtain reliable subspace estimates. 2. **The Curse of Ill - Conditioning**: Even in the absence of noise, when the condition number of $X^\star$ is large, the existing HeteroPCA algorithm will also fail. This is because a large condition number will cause the bias introduced by the diagonal deletion operation to not be effectively eliminated. ### Solutions To solve the above problems, the author proposes a new algorithm - Deflated - HeteroPCA. This algorithm overcomes "the curse of ill - conditioning" in the following ways: - **Block Processing**: Divide the spectrum of $X^\star$ into several sub - blocks with smaller and mutually separated condition numbers, and apply HeteroPCA to each sub - block in turn. - **Gradually Eliminating Bias**: Through the method of gradually "deflating" (deflation), gradually reduce the bias introduced by the diagonal deletion operation, thereby improving the accuracy of the estimate. ### Theoretical Guarantees The author provides strict theoretical guarantees for Deflated - HeteroPCA, including: - **Spectral Norm Error**: Under the spectral norm, the estimation error of Deflated - HeteroPCA does not depend on the condition number and achieves the optimal statistical performance. - **Fine - grained $\ell_{2,\infty}$ Norm Error**: Under the more refined $\ell_{2,\infty}$ norm, Deflated - HeteroPCA also performs well, and the error does not depend on the condition number. ### Application Examples The author also shows the superiority of Deflated - HeteroPCA in two typical applications: - **Factor Model**: In the factor model, Deflated - HeteroPCA achieves an optimal estimate independent of the condition number. - **Tensor PCA**: Combined with the HOOI algorithm, Deflated - HeteroPCA also shows better performance in tensor PCA. In summary, this paper aims to overcome the "curse of ill - conditioning" encountered by existing PCA methods when dealing with heteroscedastic noise and unbalanced - dimension data by proposing the Deflated - HeteroPCA algorithm, and provides strict theoretical guarantees and practical application verification.

Deflated HeteroPCA: Overcoming the curse of ill-conditioning in heteroskedastic PCA

Expectile regression for analyzing heteroscedasticity in high dimension

Inference for Heteroskedastic PCA with Missing Data

Towards a Theoretical Analysis of PCA for Heteroscedastic Data

HePPCAT: Probabilistic PCA for Data with Heteroscedastic Noise

Improved Algorithms for High-Dimensional Robust Pca

Heteroskedastic Tensor Clustering

ALPCAH: Sample-wise Heteroscedastic PCA with Tail Singular Value Regularization

Optimal Spectral Shrinkage and PCA With Heteroscedastic Noise

Streaming Probabilistic PCA for Missing Data with Heteroscedastic Noise

Sparse sufficient dimension reduction with heteroscedasticity

Normalized Robust PCA With Adaptive Reconstruction Error Minimization

Robust PCA via Outlier Pursuit

Robust PCA as Bilinear Decomposition with Outlier-Sparsity Regularization

Diagonally-Dominant Principal Component Analysis

On the Estimation Performance of Generalized Power Method for Heteroscedastic Probabilistic PCA

High-Dimensional Principal Component Analysis with Heterogeneous Missingness

Invariant subspaces and PCA in nearly matrix multiplication time

Uniform error bound for PCA matrix denoising

Learning Feature Sparse Principal Subspace.

Recovering PCA from Hybrid-$(\ell_1,\ell_2)$ Sparse Sampling of Data Elements