Abstract:Many hospitals use early warning scores to help clinicians recognize potentially deteriorating patients and intervene early. Systematic reviews 1 -3 have identified more than 30 such scores, which vary widely in the methods used for their development and validation. Edelson and colleagues 4 compared 6 early warning scores across more than 362 000 medical-surgical ward encounters in 7 hospitals in the Yale New Haven Health System. They compared 3 statistically advanced scores (eCART, the Rothman Index, and the Epic Deterioration Index) and 3 simpler, points-based scores (National Early Warning Score [NEWS], NEWS2, and Modified Early Warning Score [MEWS]) in their ability to predict ward-to–intensive care unit (ICU) transfer or death within 24 hours of the prediction. Accuracy and the amount of lead time between a high-risk prediction and a deterioration event varied across the scores. In some cases, the simpler scores outperformed more statistically advanced scores. The best performing score was eCART, whereas the Epic Deterioration Index was among the worst performing scores. Despite their widespread use, the evidence base for early warning scores remains surprisingly thin. Many scores have serious methodological flaws or have not been externally validated, and relatively few scores are shared openly. 1 There have been few rigorous evaluations of clinical impact, with only a small number of studies showing improved patient outcomes. 3 Thus, despite their promise, there is still substantial uncertainty about which early warning scores should be used and how they should be implemented. The comparative performance of early warning scores is poorly understood because of heterogeneity in the datasets and methods used to develop and validate each score. By benchmarking the performance of early warning scores in a large, multicenter, external dataset, Edelson and colleagues 4 make an important contribution to the literature. Although they compared several commonly used scores, it is unfortunate that many other models are not shared openly and could not also be compared, with the most obvious omission being the Advanced Alert Monitor, which was implemented to reduce 30-day mortality in 21 Kaiser Permanente Northern California hospitals. 5 The study's findings somewhat contradict the previous literature. Systematic reviews have found that statistically advanced models, including those that use machine learning, tend to outperform simpler, points-based scores. 2 However, such studies are often conducted in the datasets that are used to train the advanced models and thus may produce optimistic estimates of model performance. In this direct comparison in an external dataset, the statistically advanced scores were not uniformly better than simpler ones. The eCART score was superior across various comparisons, but the simple NEWS and NEWS2 scores performed similarly to the Rothman Index and were better than the Epic Deterioration Index. It is worth noting that eCART was the only model in this study that was based on machine learning. It is a gradient-boosted machine learning model with 97 predictors. In contrast, the Epic Deterioration Index is an ordinal logistic regression model with 17 predictors, and the Rothman Index is a heuristic model that aggregates mortality risk associated with 26 individual variables using advanced statistics but not machine learning. The simpler NEWS and NEWS2 models are also based on logistic regression (with 7 input variables), and the worst-performing model, MEWS, was based on expert consensus and 5 inputs. Although the study's authors 4 describe only the first 3 models as artificial intelligence (AI), it is not clear where this boundary should be drawn or whether this distinction is useful. To understand the performance of a prediction model, it is more helpful to take a holistic view of its development, which includes the statistical model but also considers other factors, such as the size, quality, and diversity of training data; the definition of outcomes; and the selection of predictors. To this end, it is important that the development of prediction models is reported transparently and completely, even when they are proprietary. Edelson and colleagues 4 demonstrate that well-developed machine learning models, like eCART, can be more accurate than simpler alternatives. But advanced models can also fail to generalize well to external datasets. One school of thought suggests that we should not try to develop AI models for widespread out-of-the-box use (ie, without modification). Instead, AI models could be retrained in each local environment to optimize performance. Although pretrained foundation models 6 may make this easier in the future, this strategy is not practical in the near term, as the expertise and computational resou -Abstract Truncated-

Multicenter Development and Prospective Validation of eCARTv5: A Gradient Boosted Machine Learning Early Warning Score

Less is more: Detecting clinical deterioration in the hospital with machine learning using only age, heart rate, and respiratory rate

Detecting Deteriorating Patients in the Hospital: Development and Validation of a Novel Scoring System

Toward the Rigorous Evaluation of Early Warning Scores

The Impact of a Machine Learning Early Warning Score on Hospital Mortality: A Multicenter Clinical Intervention Trial

Early identification of patients admitted to hospital for covid-19 at risk of clinical deterioration: model development and multisite external validation study

Improved Sensitivity For Detection Of Clinical Deterioration When Diagnostic Pathology And Patient Trends Are Included In Machine Learning Models

Clinical evaluation of a machine learning-based early warning system for patient deterioration

Development and Validation of a Machine Learning COVID-19 Veteran (COVet) Deterioration Risk Score

Machine Learning–Based Early Warning Systems for Clinical Deterioration: Systematic Scoping Review

Development of an enhanced scoring system to predict ICU readmission or in-hospital death within 24 hours using routine patient data from two NHS Foundation Trusts

EventScore: An Automated Real-time Early Warning Score for Clinical Events

Abstract 350: Measuring the Return on Investment of an Artificial Intelligence (AI) Early Warning System

Development and validation of an early warning tool for sepsis and decompensation in children during emergency department triage

A prediction model for prehospital clinical deterioration: The use of early warning scores

Prediction of Clinical Deterioration in Hospitalized Adult Patients with Hematologic Malignancies Using a Neural Network Model

Comparing the predictive ability of a commercial artificial intelligence early warning system with physician judgement for clinical deterioration in hospitalised general internal medicine patients: a prospective observational study

Performance of universal early warning scores in different patient subgroups and clinical settings: a systematic review

Optimal timing for the Modified Early Warning Score for prediction of short-term critical illness in the acute care chain: a prospective observational study

Multicenter validation of a deep-learning-based pediatric early-warning system for prediction of deterioration events

The Kaiser Permanente Northern California Advance Alert Monitor Program: An Automated Early Warning System for Adults at Risk for In-Hospital Clinical Deterioration