Abstract:Abstract Background The replication crisis hit the medical sciences about a decade ago, but today still most of the flaws inherent in null hypothesis significance testing (NHST) have not been solved. While the drawbacks of p -values have been detailed in endless venues, for clinical research, only a few attractive alternatives have been proposed to replace p -values and NHST. Bayesian methods are one of them, and they are gaining increasing attention in medical research, as some of their advantages include the description of model parameters in terms of probability, as well as the incorporation of prior information in contrast to the frequentist framework. While Bayesian methods are not the only remedy to the situation, there is an increasing agreement that they are an essential way to avoid common misconceptions and false interpretation of study results. The requirements necessary for applying Bayesian statistics have transitioned from detailed programming knowledge into simple point-and-click programs like JASP. Still, the multitude of Bayesian significance and effect measures which contrast the gold standard of significance in medical research, the p -value, causes a lack of agreement on which measure to report. Methods Therefore, in this paper, we conduct an extensive simulation study to compare common Bayesian significance and effect measures which can be obtained from a posterior distribution. In it, we analyse the behaviour of these measures for one of the most important statistical procedures in medical research and in particular clinical trials, the two-sample Student’s (and Welch’s) t-test. Results The results show that some measures cannot state evidence for both the null and the alternative. While the different indices behave similarly regarding increasing sample size and noise, the prior modelling influences the obtained results and extreme priors allow for cherry-picking similar to p-hacking in the frequentist paradigm. The indices behave quite differently regarding their ability to control the type I error rates and regarding their ability to detect an existing effect. Conclusion Based on the results, two of the commonly used indices can be recommended for more widespread use in clinical and biomedical research, as they improve the type I error control compared to the classic two-sample t-test and enjoy multiple other desirable properties.

Simulation methodologies to determine statistical power in laboratory animal research studies

Determining sample size adequacy for animal model studies in nutrition research: limits and ethical challenges of ordinary power calculation procedures

Using simulation studies to evaluate statistical methods

Effect size, sample size and power of forced swim test assays in mice: Guidelines for investigators to optimize reproducibility

Understanding p-values and significance

Enhancing Statistical Power While Maintaining Small Sample Sizes in Behavioral Neuroscience Experiments Evaluating Success Rates

Statistical simulations show that scientists need not increase overall sample size by default when including both sexes in in vivo studies

Recommendations to improve use and reporting of statistics in animal experiments

How much confidence do we need in animal experiments? Statistical assumptions in sample size estimation

Bayesian statistical concepts with examples from rodent toxicology studies

Simulation-Based Power Analyses for the Smallest Effect Size of Interest: A Confidence-Interval Approach for Minimum-Effect and Equivalence Testing

Finding the right power balance: Better study design and collaboration can reduce dependence on statistical power

Sample size determination and post hoc statistical power in bioequivalence studies

Pitfalls and potentials in simulation studies: Questionable research practices in comparative simulation studies allow for spurious claims of superiority of any method

PDXpower: A Power Analysis Tool for Experimental Design in Pre-clinical Xenograft Studies for Uncensored and Censored Outcomes

Considering aspects of the 3Rs principles within experimental animal biology

Study design: think 'scientific value' not 'p-values'

Analysis of Bayesian posterior significance and effect size indices for the two-sample t-test to support reproducible medical research

Power Analysis Software for Educational Researchers

Data simulations for advancing psychological research: Insights, preparations and investigations

Simulation-Based Power-Analysis for Factorial ANOVA Designs