Abstract:Many hospitals use early warning scores to help clinicians recognize potentially deteriorating patients and intervene early. Systematic reviews 1 -3 have identified more than 30 such scores, which vary widely in the methods used for their development and validation. Edelson and colleagues 4 compared 6 early warning scores across more than 362 000 medical-surgical ward encounters in 7 hospitals in the Yale New Haven Health System. They compared 3 statistically advanced scores (eCART, the Rothman Index, and the Epic Deterioration Index) and 3 simpler, points-based scores (National Early Warning Score [NEWS], NEWS2, and Modified Early Warning Score [MEWS]) in their ability to predict ward-to–intensive care unit (ICU) transfer or death within 24 hours of the prediction. Accuracy and the amount of lead time between a high-risk prediction and a deterioration event varied across the scores. In some cases, the simpler scores outperformed more statistically advanced scores. The best performing score was eCART, whereas the Epic Deterioration Index was among the worst performing scores. Despite their widespread use, the evidence base for early warning scores remains surprisingly thin. Many scores have serious methodological flaws or have not been externally validated, and relatively few scores are shared openly. 1 There have been few rigorous evaluations of clinical impact, with only a small number of studies showing improved patient outcomes. 3 Thus, despite their promise, there is still substantial uncertainty about which early warning scores should be used and how they should be implemented. The comparative performance of early warning scores is poorly understood because of heterogeneity in the datasets and methods used to develop and validate each score. By benchmarking the performance of early warning scores in a large, multicenter, external dataset, Edelson and colleagues 4 make an important contribution to the literature. Although they compared several commonly used scores, it is unfortunate that many other models are not shared openly and could not also be compared, with the most obvious omission being the Advanced Alert Monitor, which was implemented to reduce 30-day mortality in 21 Kaiser Permanente Northern California hospitals. 5 The study's findings somewhat contradict the previous literature. Systematic reviews have found that statistically advanced models, including those that use machine learning, tend to outperform simpler, points-based scores. 2 However, such studies are often conducted in the datasets that are used to train the advanced models and thus may produce optimistic estimates of model performance. In this direct comparison in an external dataset, the statistically advanced scores were not uniformly better than simpler ones. The eCART score was superior across various comparisons, but the simple NEWS and NEWS2 scores performed similarly to the Rothman Index and were better than the Epic Deterioration Index. It is worth noting that eCART was the only model in this study that was based on machine learning. It is a gradient-boosted machine learning model with 97 predictors. In contrast, the Epic Deterioration Index is an ordinal logistic regression model with 17 predictors, and the Rothman Index is a heuristic model that aggregates mortality risk associated with 26 individual variables using advanced statistics but not machine learning. The simpler NEWS and NEWS2 models are also based on logistic regression (with 7 input variables), and the worst-performing model, MEWS, was based on expert consensus and 5 inputs. Although the study's authors 4 describe only the first 3 models as artificial intelligence (AI), it is not clear where this boundary should be drawn or whether this distinction is useful. To understand the performance of a prediction model, it is more helpful to take a holistic view of its development, which includes the statistical model but also considers other factors, such as the size, quality, and diversity of training data; the definition of outcomes; and the selection of predictors. To this end, it is important that the development of prediction models is reported transparently and completely, even when they are proprietary. Edelson and colleagues 4 demonstrate that well-developed machine learning models, like eCART, can be more accurate than simpler alternatives. But advanced models can also fail to generalize well to external datasets. One school of thought suggests that we should not try to develop AI models for widespread out-of-the-box use (ie, without modification). Instead, AI models could be retrained in each local environment to optimize performance. Although pretrained foundation models 6 may make this easier in the future, this strategy is not practical in the near term, as the expertise and computational resou -Abstract Truncated-

Is the new model better? One metric says yes, but the other says no. Which metric do I use?

A relationship between the incremental values of area under the ROC curve and of area under the precision-recall curve

A new prediction model for assessing the clinical outcomes of ICU patients with community-acquired pneumonia: a decision tree analysis.

Decision Curve Analysis: a Technical Note

Evaluating Health Risk Models

Testing for improvement in prediction model performance

Performance Metrics for the Comparative Analysis of Clinical Risk Prediction Models Employing Machine Learning

Weighted Brier Score -- an Overall Summary Measure for Risk Prediction Models with Clinical Utility Consideration

Schroedinger's Threshold: When the AUC doesn't predict Accuracy

Measuring the Impact of New Risk Factors Within Survival Models

Novel decision-theoretic and risk-stratification metrics of predictive performance: Application to deciding who should undergo genetic testing

Uncertain about uncertainty in matching‐adjusted indirect comparisons? A simulation study to compare methods for variance estimation

Joint Concordance Index

Modified Brier score for evaluating prediction accuracy for binary outcomes

Toward the Rigorous Evaluation of Early Warning Scores

Risk‐sensitive markov decision processes with long‐run CVaR criterion

The Misuse of AUC: What High Impact Risk Assessment Gets Wrong

Test for Incremental Value of New Biomarkers Based on OR Rules

Evaluating discriminatory accuracy of models using partial risk-scores in two-phase studies

Assessing Model Generalization in Vicinity

Medical diagnostic accuracy measures: an innovative approach based on the area under predictive values curves