Toward the Rigorous Evaluation of Early Warning Scores
Amol A. Verma
DOI: https://doi.org/10.1001/jamanetworkopen.2024.38966
2024-10-16
JAMA Network Open
Abstract:Many hospitals use early warning scores to help clinicians recognize potentially deteriorating patients and intervene early. Systematic reviews 1 -3 have identified more than 30 such scores, which vary widely in the methods used for their development and validation. Edelson and colleagues 4 compared 6 early warning scores across more than 362 000 medical-surgical ward encounters in 7 hospitals in the Yale New Haven Health System. They compared 3 statistically advanced scores (eCART, the Rothman Index, and the Epic Deterioration Index) and 3 simpler, points-based scores (National Early Warning Score [NEWS], NEWS2, and Modified Early Warning Score [MEWS]) in their ability to predict ward-to–intensive care unit (ICU) transfer or death within 24 hours of the prediction. Accuracy and the amount of lead time between a high-risk prediction and a deterioration event varied across the scores. In some cases, the simpler scores outperformed more statistically advanced scores. The best performing score was eCART, whereas the Epic Deterioration Index was among the worst performing scores. Despite their widespread use, the evidence base for early warning scores remains surprisingly thin. Many scores have serious methodological flaws or have not been externally validated, and relatively few scores are shared openly. 1 There have been few rigorous evaluations of clinical impact, with only a small number of studies showing improved patient outcomes. 3 Thus, despite their promise, there is still substantial uncertainty about which early warning scores should be used and how they should be implemented. The comparative performance of early warning scores is poorly understood because of heterogeneity in the datasets and methods used to develop and validate each score. By benchmarking the performance of early warning scores in a large, multicenter, external dataset, Edelson and colleagues 4 make an important contribution to the literature. Although they compared several commonly used scores, it is unfortunate that many other models are not shared openly and could not also be compared, with the most obvious omission being the Advanced Alert Monitor, which was implemented to reduce 30-day mortality in 21 Kaiser Permanente Northern California hospitals. 5 The study's findings somewhat contradict the previous literature. Systematic reviews have found that statistically advanced models, including those that use machine learning, tend to outperform simpler, points-based scores. 2 However, such studies are often conducted in the datasets that are used to train the advanced models and thus may produce optimistic estimates of model performance. In this direct comparison in an external dataset, the statistically advanced scores were not uniformly better than simpler ones. The eCART score was superior across various comparisons, but the simple NEWS and NEWS2 scores performed similarly to the Rothman Index and were better than the Epic Deterioration Index. It is worth noting that eCART was the only model in this study that was based on machine learning. It is a gradient-boosted machine learning model with 97 predictors. In contrast, the Epic Deterioration Index is an ordinal logistic regression model with 17 predictors, and the Rothman Index is a heuristic model that aggregates mortality risk associated with 26 individual variables using advanced statistics but not machine learning. The simpler NEWS and NEWS2 models are also based on logistic regression (with 7 input variables), and the worst-performing model, MEWS, was based on expert consensus and 5 inputs. Although the study's authors 4 describe only the first 3 models as artificial intelligence (AI), it is not clear where this boundary should be drawn or whether this distinction is useful. To understand the performance of a prediction model, it is more helpful to take a holistic view of its development, which includes the statistical model but also considers other factors, such as the size, quality, and diversity of training data; the definition of outcomes; and the selection of predictors. To this end, it is important that the development of prediction models is reported transparently and completely, even when they are proprietary. Edelson and colleagues 4 demonstrate that well-developed machine learning models, like eCART, can be more accurate than simpler alternatives. But advanced models can also fail to generalize well to external datasets. One school of thought suggests that we should not try to develop AI models for widespread out-of-the-box use (ie, without modification). Instead, AI models could be retrained in each local environment to optimize performance. Although pretrained foundation models 6 may make this easier in the future, this strategy is not practical in the near term, as the expertise and computational resou -Abstract Truncated-
medicine, general & internal