Abstract:Epidemiological analyses often rely on p-values, which have lost their classical insight. We reinforce the classical interpretation of p-values in randomized experiments, especially in settings with “big data” and consequently many tests. Fisher first introduced the concept of p-value by tying it to the need for randomization of units to treatments. He proposed that researchers assess the sharp null hypothesis by conducting a randomization test summarized by a p-value. First, one choses a suitable test statistic, S, and one calculates its observed value, say S*. Then, one constructs the S's distribution induced by the randomization under the null hypothesis. To obtain such a randomization distribution, one enumerates all possible treatment assignments (Ntotal) based on the assignment mechanism, and for each, one calculates the value of S that would have been observed with that assignment. The proportion of such values of S across the possible randomizations that are as large or larger than S* is the p-value. To illustrate, consider a randomized experiment with 20 participants and 3 treatments given to each in random order so there are six (=3*2*1) possible sequences of treatments. Whatever the test statistic, the minimum p-value that can be achieved equals 1/Ntotal (i.e., 1/(20*6)≈0.0083), which is achieved when S* is the largest of all possible values of S. Ntotal depends on the number of units and treatments. For example, Bonferroni adjustments (e.g., dividing the significance level by the number of tests), often used to “correct” for multiple testing in environmental studies with high-dimensional outcomes (e.g., methylation on 450,000 CpG sites), ignore the classical insight of randomization-based p-values, because it is applied to model-based p-values, which are not justified by the study design. Here, applying Bonferroni with thousands of tests would yield a nonsensical “corrected” significance level less than the minimal achievable p-value.

Another look at the Lady Tasting Tea and differences between permutation tests and randomization tests

An Empirical Comparison of Parametric and Permutation Tests for Regression Analysis of Randomized Experiments

On the Term "Randomization Test"

Consistency of invariance-based randomization tests

The Classification Permutation Test: A Nonparametric Test for Equality of Multivariate Distributions

Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn

A Permutation Test for Assessing the Presence of Individual Differences in Treatment Effects

Fast Approximation of Small P-values in Permutation Tests by Partitioning the Permutations

Permutation Tests at Nonparametric Rates

A randomization-based perspective of analysis of variance: a test statistic robust to treatment effect heterogeneity

High-Dimensional Randomized Crossover Studies: A Clarification of P-Values Interpretation

Functional Response Designs via the Analytic Permutation Test

Sequential Permutation Testing of Random Forest Variable Importance Measures

More powerful logrank permutation tests for two-sample survival data

Some theoretical foundations for the design and analysis of randomized experiments

Bridging the Gap Between Design and Analysis: Randomization Inference and Sensitivity Analysis for Matched Observational Studies with Treatment Doses

Studentized Permutation Method for Comparing Restricted Mean Survival Times with Small Sample from Randomized Trials

Randomization does not help much, comparability does

Randomization Tests for Peer Effects in Group Formation Experiments

Randomization inference for treatment effect variation

A studentized permutation test in group sequential designs