Abstract:$P$-values that are derived from continuously distributed test statistics are typically uniformly distributed on $(0,1)$ under least favorable parameter configurations (LFCs) in the null hypothesis. Conservativeness of a $p$-value $P$ (meaning that $P$ is under the null hypothesis stochastically larger than a random variable which is uniformly distributed on $(0,1)$) can occur if the test statistic from which $P$ is derived is discrete, or if the true parameter value under the null is not an LFC. To deal with both of these sources of conservativeness, we present two approaches utilizing randomized $p$-values, namely single-stage and two-stage randomization. We illustrate their effectiveness for testing a composite null hypothesis under a binomial model. We also give an example of how the proposed $p$-values can be used to test a composite null in group testing designs. Similar to previous findings, we find that the proposed randomized $p$-values are less conservative compared to non-randomized $p$-values under the null hypothesis, but that they are stochastically not smaller under the alternative. The problem of establishing the validity of randomized $p$-values is not trivial and has received attention in previous literature. We show that our proposed randomized $p$-values are valid under various discrete statistical models which are such that the distribution of the corresponding test statistic belongs to an exponential family. The behaviour of the power function for the tests based on the proposed randomized $p$-values as a function of the sample size is also investigated. Simulations and a real data analysis are used to compare the different considered $p$-values.

Early Stopping Based on Repeated Significance

Continuous Monitoring via Repeated Significance

Statistical properties of an early stopping rule for resampling-based multiple testing.

Simple solution to a common statistical problem: Interpreting multiple tests

Measuring the robustness of predictive probability for early stopping in two-group comparisons

Some results on signal detection with one-sided stopping and deadline

Post-hoc Hypothesis Testing

Multiple testing of composite null hypotheses for discrete data using randomized $p$-values

On limiting behaviors of stepwise multiple testing procedures

On stepdown control of the false discovery proportion

A New Multiple Testing Method in the Dependent Case

Addressing researcher degrees of freedom through minP adjustment

Measuring the Robustness of Predictive Probability for Early Stopping in Experimental Design

Multiple testing using uniform filtering of ordered p-values

Theoretical Justification of the Bi Error Method

Exact and Approximate Stepdown Methods for Multiple Hypothesis Testing

On Asymptotic Behaviors of Stepwise Multiple Testing Procedures

Heterocedasticity-Adjusted Ranking and Thresholding for Large-Scale Multiple Testing

Robustness of multiple testing procedures against dependence

Multiple testing when many $p$-values are uniformly conservative, with application to testing qualitative interaction in educational interventions

Bagged Empirical Null p-values: A Method to Account for Model Uncertainty in Large Scale Inference