Abstract:A century ago, when Student's t-statistic was introduced, no one ever imagined its increasing applicability in the modern era. It finds applications in highly multiple hypothesis testing, feature selection and ranking, high dimensional signal detection, etc. Student's t-statistic is constructed based on the empirical distribution function (EDF). An alternative choice to the EDF is the kernel density estimate (KDE), which is a smoothed version of the EDF. The novelty of the work consists of an alternative to Student's t-test that uses the KDE technique and exploration of the usefulness of KDE based t-test in the context of its application to large-scale simultaneous hypothesis testing. An optimal bandwidth parameter for the KDE approach is derived by minimizing the asymptotic error between the true p-value and its asymptotic estimate based on normal approximation. If the KDE-based approach is used for large-scale simultaneous testing, then it is interesting to consider, when does the method fail to manage the error rate? We show that the suggested KDE-based method can control false discovery rate (FDR) if total number tests diverge at a smaller order of magnitude than N3/2, where N is the total sample size. We compare our method to several possible alternatives with respect to FDR. We show in simulations that our method produces a lower proportion of false discoveries than its competitors. That is, our method better controls the false discovery rate than its competitors. Through these empirical studies, it is shown that the proposed method can be successfully applied in practice. The usefulness of the proposed methods is further illustrated through a gene expression data example.

Kernel Two-Sample Hypothesis Testing Using Kernel Set Classification

Generalized kernel two-sample tests

Exponentially Consistent Kernel Two-Sample Tests

A uniform kernel trick for high-dimensional two-sample problems

A Kernel Method for the Two-Sample Problem

Kernel Two-Sample Tests in High Dimension: Interplay Between Moment Discrepancy and Dimension-and-Sample Orders

On the Decreasing Power of Kernel and Distance Based Nonparametric Hypothesis Tests in High Dimensions

Learning Deep Kernels for Non-Parametric Two-Sample Tests

Hypothesis testing using pairwise distances and associated kernels (with Appendix)

Universal Hypothesis Testing with Kernels: Asymptotically Optimal Tests for Goodness of Fit

Kernel Two-Sample Tests for Manifold Data

Shared kernel Bayesian screening

Two-sample Testing Using Deep Learning

A two-sample test for high-dimensional data with applications to gene-set testing

Bayesian Optimal Two-sample Tests in High-dimension

Two-Sample Test Based on Classification Probability

Simulation-Based Hypothesis Testing of High Dimensional Means Under Covariance Heterogeneity

Two-Sample Test for Sparse High Dimensional Multinomial Distributions

Large-Scale Simultaneous Testing Using Kernel Density Estimation

A Kernel-Based Conditional Two-Sample Test Using Nearest Neighbors (with Applications to Calibration, Regression Curves, and Simulation-Based Inference)

Kernel Robust Hypothesis Testing