Abstract:Much empirical science involves evaluating alternative explanations for the obtained data. For example, given certain assumptions underlying a statistical test, a "significant" result generally refers to implausibility of a null (zero) effect in the population producing the obtained study data. However, methodological work on various versions of p-hacking (i.e., using different analysis strategies until a "significant" result is produced) questions whether significant p-values might often reflect false findings. Indeed, initial simulations of single studies showed that the potential for finding "significant" but false findings might be much higher than the nominal .05 value when various analysis flexibilities are undertaken. In many settings, however, research articles report multiple studies using consistent methods across the studies, where those consistent methods would constrain the flexibilities used to produce high false-finding rates for simulations of single studies. Thus, we conducted simulations of study sets. These simulations show that consistent methods across studies (i.e., consistent in terms of which measures are analyzed, which conditions are included, and whether and how covariates are included) dramatically reduce the potential for flexible research practices (p-hacking) to produce consistent sets of significant results across studies. For p-hacking to produce even modest probabilities of a consistent set of studies would require (a) a large amount of selectivity in study reporting and (b) severe (and quite intentional) versions of p-hacking. With no more than modest selective reporting and with consistent methods across studies, p-hacking does not provide a plausible explanation for consistent empirical results across studies, especially as the size of the reported study set increases. In addition, the simulations show that p-hacking can produce high rates of false findings for single studies with very large samples. In contrast, a series of methodologically-consistent studies (even with much smaller samples) is much less vulnerable to the forms of p-hacking examined in the simulations.

Screening $p$-Hackers: Dissemination Noise as Bait

Detecting p-hacking

p -Hacking, Data type and Data-Sharing Policy

Raiders of the Lost HARK: a Reproducible Inference Framework for Big Data Science

Mitigating Disinformation in Social Networks through Noise

X Hacking: The Threat of Misguided AutoML

Minimalism is King! High-Frequency Energy-based Screening for Data-Efficient Backdoor Attacks

Accumulating evidence across studies: Consistent methods protect against false findings produced by p-hacking

Multiple testing of composite null hypotheses for discrete data using randomized $p$-values

Generalization in the Face of Adaptivity: A Bayesian Perspective

A Simple Way to Deal with Cherry-picking

Disclosure risk assessment with Bayesian non-parametric hierarchical modelling

Model Weight Theft With Just Noise Inputs: The Curious Case of the Petulant Attacker

Beyond Uniform Reverse Sampling: A Hybrid Sampling Technique for Misinformation Prevention

How to Sift Out a Clean Data Subset in the Presence of Data Poisoning?

Privacy Guarantees in Posterior Sampling under Contamination

Attack-Aware Noise Calibration for Differential Privacy

P-hacking in Academic Research : Evidence from Experimental Accounting Studies

Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models

On the Privacy of Adaptive Cuckoo Filters: Analysis and Protection

Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models