Abstract:Abstract The extraction of statistical results in scientific reports is beneficial for checking studies on plausibility and reliability. The R package JATSdecoder supports the application of text mining approaches to scientific reports. Its function get.stats() extracts all reported statistical results from text and recomputes p values for most standard test results. The output can be reduced to results with checkable or computable p values only. In this article, get.stats() ’s ability to extract, recompute and check statistical results is compared to that of statcheck , which is an already established tool. A manually coded data set, containing the number of statistically significant results in 49 articles, serves as an initial indicator for get.stats() ’s and statcheck ’s differing detection rates for statistical results. Further 13,531 PDF files by 10 mayor psychological journals, 18,744 XML documents by Frontiers of Psychology and 23,730 articles related to psychological research and published by PLoS One are scanned for statistical results with both algorithms. get.stats() almost replicates the manually extracted number of significant results in 49 PDF articles. get.stats() outperforms the statcheck functions in identifying statistical results in every included journal and input format. Furthermore, the raw results extracted by get.stats() increase statcheck ’s detection rate. JATSdecoder ’s function get.stats() is a highly general and reliable tool to extract statistical results from text. It copes with a wide range of textual representations of statistical standard results and recomputes p values for two- and one-sided tests. It facilitates manual and automated checks on consistency and completeness of the reported results within a manuscript.

Appendix - Recommended Statistical Significance Tests for NLP Tasks

More accurate tests for the statistical significance of result differences

On Statistical Non-Significance

A Comprehensive Guide for Selecting Appropriate Statistical Tests: Understanding When to Use Parametric and Nonparametric Tests

Using score distributions to compare statistical significance tests for information retrieval evaluation

Inference at Scale Significance Testing for Large Search and Recommendation Experiments

Multiple testing in statistical analysis of systems-based information retrieval experiments

Evaluation of JATSdecoder as an automated text extraction tool for statistical results in scientific reports

Significance Tests of Feature Relevance for a Black-Box Learner

Significance testing without truth

Simple solution to a common statistical problem: Interpreting multiple tests

Confirming the statistically significant superiority of tree-based machine learning algorithms over their counterparts for tabular data

Evaluation metrics and statistical tests for machine learning

Statistical Test for Feature Selection Pipelines by Selective Inference

Structure matters: Assessing the statistical significance of network topologies

Statistical Dataset Evaluation: A Case Study on Named Entity Recognition

Assessing the statistical significance of association rules

Full Bayesian Significance Testing for Neural Networks

NLP-ADBench: NLP Anomaly Detection Benchmark

When More Is Less: Pitfalls of significance testing

The statistical advantage of automatic NLG metrics at the system level