Direction-Projection-Permutation for High Dimensional Hypothesis Tests

Susan Wei,Chihoon Lee,Lindsay Wichers,Gen Li,J.S. Marron
DOI: https://doi.org/10.48550/arXiv.1304.0796
2013-04-03
Abstract:Motivated by the prevalence of high dimensional low sample size datasets in modern statistical applications, we propose a general nonparametric framework, Direction-Projection-Permutation (DiProPerm), for testing high dimensional hypotheses. The method is aimed at rigorous testing of whether lower dimensional visual differences are statistically significant. Theoretical analysis under the non-classical asymptotic regime of dimension going to infinity for fixed sample size reveals that certain natural variations of DiProPerm can have very different behaviors. An empirical power study both confirms the theoretical results and suggests DiProPerm is a powerful test in many settings. Finally DiProPerm is applied to a high dimensional gene expression dataset.
Methodology
What problem does this paper attempt to address?
This paper attempts to solve the hypothesis - testing problem in High Dimensional Low Sample Size (HDLSS) datasets. Specifically, it proposes a non - parametric framework named Direction - Projection - Permutation (DiProPerm) for testing high - dimensional hypotheses. The following are the specific problems that the paper attempts to solve: 1. **Testing the equality of high - dimensional distributions**: - The paper proposes a non - parametric framework to strictly test whether low - dimensional visual differences are statistically significant. - One of the main objectives is to test whether two high - dimensional distributions are equal, that is, to test the following hypothesis: \[ H_0: F_1 = F_2 \quad \text{vs} \quad H_1: F_1 \neq F_2 \] where \( F_1 \) and \( F_2 \) are the distributions of two sets of samples respectively. 2. **Testing the equality of means**: - Another objective is to test whether the means of two high - dimensional distributions are equal, that is, to test the following hypothesis: \[ H_0: \mu(F_1) = \mu(F_2) \quad \text{vs} \quad H_1: \mu(F_1) \neq \mu(F_2) \] 3. **Dealing with the over - fitting problem in high - dimensional low - sample - size datasets**: - In very high - dimensional cases, many linear classifiers are prone to over - fitting. The paper avoids this over - fitting problem by using the DiProPerm method to evaluate whether the differences on one - dimensional projections are statistically significant. 4. **Providing effective testing methods**: - Through theoretical analysis and empirical research, the paper shows that the DiProPerm method is a powerful testing method in many cases and can be effectively applied to high - dimensional gene - expression datasets in practical applications. In summary, the main purpose of this paper is to provide an effective, non - parametric hypothesis - testing framework for HDLSS datasets to solve the problems of high - dimensional distribution and mean - equality testing and overcome the over - fitting challenges brought by high - dimensional data.