Optimal rank-based testing for principal components

Marc Hallin,Davy Paindaveine,Thomas Verdebout
DOI: https://doi.org/10.1214/10-AOS810
2012-11-09
Abstract:This paper provides parametric and rank-based optimal tests for eigenvectors and eigenvalues of covariance or scatter matrices in elliptical families. The parametric tests extend the Gaussian likelihood ratio tests of Anderson (1963) and their pseudo-Gaussian robustifications by Davis (1977) and Tyler (1981, 1983). The rank-based tests address a much broader class of problems, where covariance matrices need not exist and principal components are associated with more general scatter matrices. The proposed tests are shown to outperform daily practice both from the point of view of validity as from the point of view of efficiency. This is achieved by utilizing the Le Cam theory of locally asymptotically normal experiments, in the nonstandard context, however, of a curved parametrization. The results we derive for curved experiments are of independent interest, and likely to apply in other contexts.
Statistics Theory
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the hypothesis - testing problems in Principal Component Analysis (PCA), especially the optimal rank - testing of eigenvectors and eigenvalues in the elliptical distribution family. Specifically, the paper focuses on two core problems: 1. **Hypothesis - testing of the first principal direction**: - Test the null hypothesis \( H_{\beta_0} \): The first principal direction \( \beta_1 \) is consistent with a specified unit vector \( \beta_0 \) (up to a sign). Here, the first principal direction is chosen for the simplicity of expression. In fact, any principal direction can be chosen. 2. **Hypothesis - testing of the proportion of partial principal component variances**: - Test the null hypothesis \( H_{\Lambda_0} \): Whether the proportion of the sum of the variances of the last \( k - q \) principal components to the total variance \( \frac{\sum_{j = q+1}^k \lambda_j}{\sum_{j = 1}^k \lambda_j} \) is equal to a given value \( p\), where \( p\in(0, 1)\). ### Background and Motivation Traditional PCA methods are usually based on the Gaussian distribution assumption, using maximum likelihood estimation (MLE) and the corresponding Wald test or Gaussian likelihood ratio test (LRT). However, these methods perform poorly under non - Gaussian distributions, especially in the presence of heavy - tailed distributions or outliers. Therefore, the paper proposes a class of rank - based testing methods, which are still valid under any elliptical distribution and do not require any moment conditions. ### Main Contributions 1. **Proposal of rank - tests**: - The paper proposes rank - based testing methods, which are valid under any elliptical distribution and do not depend on the Gaussian assumption. In particular, the van der Waerden test (i.e., the normal score test) is asymptotically equivalent to the optimal Gaussian LRT under the Gaussian distribution and performs better under non - Gaussian conditions. 2. **Application of the local asymptotic normality (LAN) theory**: - The paper uses Le Cam's local asymptotic normality (LAN) theory to deal with the complex functional relationships between eigenvectors and eigenvalues and the possible multiple - eigenvalue problems. By introducing curved parametrization, it solves the problems that are difficult to handle in the standard LAN framework. 3. **Optimality results**: - The paper proves that under a specific radial density, the rank - based testing methods are locally asymptotically optimal. In particular, the van der Waerden test has better consistency than the existing pseudo - Gaussian testing methods under non - Gaussian conditions. ### Method Overview - **Rank - test statistics**: - For the test of the first principal direction, use the statistic \( Q^{(n)}_K \), which has the form: \[ Q^{(n)}_K:=\frac{nk(k + 2)}{J_k(K)}\sum_{j = 2}^k(\tilde{\beta}_j'S^{(n)}_K\beta_0)^2 \] - For the test of the proportion of partial principal component variances, use the statistic \( T^{(n)}_K \), which has the form: \[ T^{(n)}_K:=\left(\frac{nk(k + 2)}{J_k(K)}\right)^{1/2}(a_{p,q}(\tilde{\Lambda}_V))^{-1/2}c_p^T\text{dvec}(\tilde{\Lambda}_V^{1/2}\hat{\beta}^TS^{(n)}_K\hat{\beta}\tilde{\Lambda}_V^{1/2}) \] - **Curvature parametrization**: - By introducing curvature parametrization, the paper solves the complex functional relationships between eigenvectors and eigenvalues, so that the local asymptotic normality theory can be applied to this.