Abstract:With the plethora of machine learning (ML) analyses published in the orthopaedic literature within the last 5 years, several attempts have been made to enhance our understanding of what exactly ML means and how it is used. At its most fundamental level, ML comprises a branch of artificial intelligence that uses algorithms to analyze and learn from patterns in data without explicit programming or human intervention. On the other hand, traditional statistics require a user to specifically choose variables of interest to create a model capable of predicting an outcome, the output of which (1) may be falsely influenced by the variables chosen to be included by the user and (2) does not allow for optimization of performance. Early publications have served as succinct editorials or reviews intended to ease audiences unfamiliar with ML into the complexities that accompany the subject. Most commonly, the focus of these studies concerns the terminology and concepts surrounding ML because it is important to understand the rationale behind performing such studies. Unfortunately, these publications only touch on the most basic aspects of ML and are too frequently repetitive. Indeed, the conclusion of these articles reiterate that the potential clinical utility of these algorithms remains tangential at best in their current form and caution against premature adoption without external validation. By doing so, our perspective and ability to draw our own conclusions from these studies have not advanced, and we are left concluding with each subsequent study that a new algorithm is published for an outcome of interest that cannot be used until further validation. What readers now need is to regress to embrace the principles of the scientific method that they have used to critically assess vast numbers of publications before this wave of newly applied statistical methodology-a guide to interpret results such that their own conclusions can be drawn. LEVEL OF EVIDENCE: Level V, expert opinion.

Evaluation metrics and statistical tests for machine learning

Machine Learning Evaluation Metric Discrepancies across Programming Languages and Their Components: Need for Standardization

Uncertainty-aware Evaluation of Machine Learning Performance in binary Classification Tasks

On evaluation metrics for medical applications of artificial intelligence

Machine Learning Capability: A standardized metric using case difficulty with applications to individualized deployment of supervised machine learning

A review of model evaluation metrics for machine learning in genetics and genomics

Statistical Thinking, Machine Learning

Good practices for evaluation of machine learning systems

Evaluation of machine learning algorithms for health and wellness applications: A tutorial

Analysis and Comparison of Classification Metrics

Multi-Level Comparison of Machine Learning Classifiers and Their Performance Metrics

A Unified Study of Machine Learning Explanation Evaluation Metrics

Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning

We Need to Talk About Classification Evaluation Metrics in NLP

Towards a guideline for evaluation metrics in medical image segmentation

A Guide for the Application of Statistics in Biomedical Studies Concerning Machine Learning and Artificial Intelligence

Evaluation Gaps in Machine Learning Practice

Performance Evaluation in Machine Learning: The Good, the Bad, the Ugly, and the Way Forward

Powerful A/B-Testing Metrics and Where to Find Them

Statistical hypothesis testing versus machine-learning binary classification: distinctions and guidelines

Empirical study of Machine Learning Classifier Evaluation Metrics behavior in Massively Imbalanced and Noisy data