Robust Principal Component Analysis: A Median of Means Approach

Debolina Paul,Saptarshi Chakraborty,Swagatam Das
2023-07-20
Abstract:Principal Component Analysis (PCA) is a fundamental tool for data visualization, denoising, and dimensionality reduction. It is widely popular in Statistics, Machine Learning, Computer Vision, and related fields. However, PCA is well-known to fall prey to outliers and often fails to detect the true underlying low-dimensional structure within the dataset. Following the Median of Means (MoM) philosophy, recent supervised learning methods have shown great success in dealing with outlying observations without much compromise to their large sample theoretical properties. This paper proposes a PCA procedure based on the MoM principle. Called the \textbf{M}edian of \textbf{M}eans \textbf{P}rincipal \textbf{C}omponent \textbf{A}nalysis (MoMPCA), the proposed method is not only computationally appealing but also achieves optimal convergence rates under minimal assumptions. In particular, we explore the non-asymptotic error bounds of the obtained solution via the aid of the Rademacher complexities while granting absolutely no assumption on the outlying observations. The derived concentration results are not dependent on the dimension because the analysis is conducted in a separable Hilbert space, and the results only depend on the fourth moment of the underlying distribution in the corresponding norm. The proposal's efficacy is also thoroughly showcased through simulations and real data applications.
Machine Learning,Statistics Theory
What problem does this paper attempt to address?
The main problem this paper attempts to address is the poor performance of Principal Component Analysis (PCA) in the presence of outliers. Specifically, traditional PCA methods are susceptible to the influence of outliers, which prevents them from detecting the true low-dimensional structure in the dataset. To solve this problem, the authors propose a PCA method based on the Median of Means (MoM), called MoMPCA. This method is not only computationally efficient but also achieves optimal convergence rates under minimal assumptions. The core contributions of the paper include: 1. Proposing a simple yet efficient framework for robust PCA under the paradigm of the Median of Means. 2. Providing strong theoretical support for finite sample error rates, requiring only the assumption that the data distribution has finite fourth moments. 3. Deriving generalization bounds that are dimension-independent, meaning these error rates are equally applicable to infinite-dimensional Hilbert spaces. 4. Requiring relatively loose conditions on the number of outliers, assuming only that their number is o(N), and making no assumptions about the distribution of outliers, allowing them to be correlated, unbounded, or heavy-tailed. 5. Validating the effectiveness of MoMPCA through experiments on simulated and real datasets, demonstrating its superior performance under various experimental settings.