High-Dimensional Randomized Crossover Studies: A Clarification of P-Values Interpretation
Marie-Abele Bind,Donald Rubin
DOI: https://doi.org/10.1289/isee.2015.2015-408
2015-01-01
Abstract:Epidemiological analyses often rely on p-values, which have lost their classical insight. We reinforce the classical interpretation of p-values in randomized experiments, especially in settings with “big data” and consequently many tests. Fisher first introduced the concept of p-value by tying it to the need for randomization of units to treatments. He proposed that researchers assess the sharp null hypothesis by conducting a randomization test summarized by a p-value. First, one choses a suitable test statistic, S, and one calculates its observed value, say S*. Then, one constructs the S's distribution induced by the randomization under the null hypothesis. To obtain such a randomization distribution, one enumerates all possible treatment assignments (Ntotal) based on the assignment mechanism, and for each, one calculates the value of S that would have been observed with that assignment. The proportion of such values of S across the possible randomizations that are as large or larger than S* is the p-value. To illustrate, consider a randomized experiment with 20 participants and 3 treatments given to each in random order so there are six (=3*2*1) possible sequences of treatments. Whatever the test statistic, the minimum p-value that can be achieved equals 1/Ntotal (i.e., 1/(20*6)≈0.0083), which is achieved when S* is the largest of all possible values of S. Ntotal depends on the number of units and treatments. For example, Bonferroni adjustments (e.g., dividing the significance level by the number of tests), often used to “correct” for multiple testing in environmental studies with high-dimensional outcomes (e.g., methylation on 450,000 CpG sites), ignore the classical insight of randomization-based p-values, because it is applied to model-based p-values, which are not justified by the study design. Here, applying Bonferroni with thousands of tests would yield a nonsensical “corrected” significance level less than the minimal achievable p-value.