Abstract:A century ago, when Student's t-statistic was introduced, no one ever imagined its increasing applicability in the modern era. It finds applications in highly multiple hypothesis testing, feature selection and ranking, high dimensional signal detection, etc. Student's t-statistic is constructed based on the empirical distribution function (EDF). An alternative choice to the EDF is the kernel density estimate (KDE), which is a smoothed version of the EDF. The novelty of the work consists of an alternative to Student's t-test that uses the KDE technique and exploration of the usefulness of KDE based t-test in the context of its application to large-scale simultaneous hypothesis testing. An optimal bandwidth parameter for the KDE approach is derived by minimizing the asymptotic error between the true p-value and its asymptotic estimate based on normal approximation. If the KDE-based approach is used for large-scale simultaneous testing, then it is interesting to consider, when does the method fail to manage the error rate? We show that the suggested KDE-based method can control false discovery rate (FDR) if total number tests diverge at a smaller order of magnitude than N3/2, where N is the total sample size. We compare our method to several possible alternatives with respect to FDR. We show in simulations that our method produces a lower proportion of false discoveries than its competitors. That is, our method better controls the false discovery rate than its competitors. Through these empirical studies, it is shown that the proposed method can be successfully applied in practice. The usefulness of the proposed methods is further illustrated through a gene expression data example.

Simultaneous critical values for $t$-tests in very high dimensions

Efficient three-stage $t$-tests

Tests for a Multiple-Sample Problem in High Dimensions

Finite sample t-tests for high-dimensional means

Large-Scale Simultaneous Testing Using Kernel Density Estimation

Asymptotic Uncertainty of False Discovery Proportion for Dependent $t$-Tests

Simulation-Based Hypothesis Testing of High Dimensional Means Under Covariance Heterogeneity

Incorporation of Sparsity Information in Large-scale Multiple Two-sample $t$ Tests

Optimal False Discovery Rate Control for Large Scale Multiple Testing with Auxiliary Information

Generalizing Simes' test and Hochberg's stepup procedure

A More Powerful Two-Sample Test in High Dimensions using Random Projection

Testing and Support Recovery of Multiple High-Dimensional Covariance Matrices with False Discovery Rate Control

A Unified Framework for Testing High Dimensional Parameters: A Data-Adaptive Approach.

Bayesian Optimal Two-sample Tests in High-dimension

Multiple two-sample testing under arbitrary covariance dependency with an application in imaging mass spectrometry

Consistent estimation of the proportion of false nulls and FDR for adaptive multiple testing Normal means under weak dependence

An adaptable generalization of Hotelling's $T^2$ test in high dimension

Analysis of error control in large scale two-stage multiple hypothesis testing

Optimal exact tests for multiple binary endpoints

A Unified Framework for Testing High Dimensional Parameters: A Data-Adaptive Approach

Hypothesis testing at the extremes: fast and robust association for high-throughput data