Valid and Informative P-Values from Big Data, Illustrated in Environmental Epigenomics

Marie-Abele C. Bind,Donald B. Rubin
DOI: https://doi.org/10.1289/isee.2017.2017-963
2018-01-01
ISEE Conference Abstracts
Abstract:Background/Aim: A common issue that arises with analyses of epigenomic data is the repeated use of statistical tests. Consider 17 people in a randomized experiment measuring the epigenomic effect of two exposure conditions (e.g., clean air and ozone) on DNA methylation assessed at 484,531 epigenome locations. The aim is to find the locations with an epigenetic effect of ozone versus clean air. Methods: We describe the use of randomization-based tests to obtain a Fisher exact p-value that is valid whatever the correlational structure of the data. The power of the resultant test to detect real differences, however, requires the careful a priori selection of the single test statistic. We consider the generalized Elastic-Net regularization. We choose the tuning parameters that minimized the Bayesian Information Criterion using a two-dimensional grid search. These two penalties aim to shrink the "irrelevant" regression coefficients towards zero and has been suggested to have the "Oracle" property (i.e., consistency in variable selection and asymptotic normality of the estimated non-zero coefficients). Note that many epigenomic studies stop at this step, i.e., after the regularization method is performed, studies report the estimated "true" non-zero coefficients assuming the "Oracle" property implicitly. The main innovation of our approach consists of going beyond one step by using the non-zero coefficients to construct a test statistic and provide a randomization test-based p-value. Results: The Elastic-Net procedure selected 13 CpG sites and the associated Fisher exact p-value was equal to 0.14. Here, we provide a non-parametric, non-asymptotic, approach that provides a p-value that suggests modest support towards the Elastic-Net selection and modest evidence about causal effects of ozone on the epigenome. Conclusions: This procedure is compatible with any test statistic and generates valid and informative p-values. To our knowledge, this is the first time that regularization methods are coupled with Fisherian inference.
What problem does this paper attempt to address?