Abstract:Researchers often misinterpret and misrepresent statistical outputs. This abuse has led to a large literature on modification or replacement of testing thresholds and $P$-values with confidence intervals, Bayes factors, and other devices. Because the core problems appear cognitive rather than statistical, we review simple aids to statistical interpretations. These aids emphasize logical and information concepts over probability, and thus may be more robust to common misinterpretations than are traditional descriptions. We use the Shannon transform of the $P$-value $p$, also known as the binary surprisal or $S$-value $s=-\log_{2}(p)$, to measure the information supplied by the testing procedure, and to help calibrate intuitions against simple physical experiments like coin tossing. We also use tables or graphs of test statistics for alternative hypotheses, and interval estimates for different percentile levels, to thwart fallacies arising from arbitrary dichotomies. Finally, we reinterpret $P$-values and interval estimates in unconditional terms, which describe compatibility of data with the entire set of analysis assumptions. We illustrate these methods with a reanalysis of data from an existing record-based cohort study. In line with other recent recommendations, we advise that teaching materials and research reports discuss $P$-values as measures of compatibility rather than significance, compute $P$-values for alternative hypotheses whenever they are computed for null hypotheses, and interpret interval estimates as showing values of high compatibility with data, rather than regions of confidence. Our recommendations emphasize cognitive devices for displaying the compatibility of the observed data with various hypotheses of interest, rather than focusing on single hypothesis tests or interval estimates. We believe these simple reforms are well worth the minor effort they require.

More accurate tests for the statistical significance of result differences

EBT: a Statistic Test Identifying Moderate Size of Significant Features with Balanced Power and Precision for Genome-Wide Rate Comparisons

Using score distributions to compare statistical significance tests for information retrieval evaluation

Comparing Accuracy Assessments to Infer Superiority of Image Classification Methods

Appendix - Recommended Statistical Significance Tests for NLP Tasks

Inference at Scale Significance Testing for Large Search and Recommendation Experiments

About statistical significance, and the lack thereof

Simple solution to a common statistical problem: Interpreting multiple tests

Multiple testing in statistical analysis of systems-based information retrieval experiments

Testing the Consistency of Performance Scores Reported for Binary Classification Problems

Consistency of invariance-based randomization tests

Testing practical relevance of treatment effects

On Statistical Non-Significance

Semantic and Cognitive Tools to Aid Statistical Science: Replace Confidence and Significance by Compatibility and Surprise

Another look at the Lady Tasting Tea and differences between permutation tests and randomization tests

Measures, Uncertainties, and Significance Test in Operational ROC Analysis

Significance testing without truth

Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation

The Limits of Assumption-free Tests for Algorithm Performance

Continuous Monitoring via Repeated Significance

Missing at Random or Not: A Semiparametric Testing Approach