Abstract:Background: Prior to implementing predictive models in novel settings, analyses of calibration and clinical usefulness remain as important as discrimination, but they are not frequently discussed. Calibration is a model's reflection of actual outcome prevalence in its predictions. Clinical usefulness refers to the utilities, costs, and harms of using a predictive model in practice. A decision analytic approach to calibrating and selecting an optimal intervention threshold may help maximize the impact of readmission risk and other preventive interventions. Objectives: To select a pragmatic means of calibrating predictive models that requires a minimum amount of validation data and that performs well in practice. To evaluate the impact of miscalibration on utility and cost via clinical usefulness analyses. Materials and methods: Observational, retrospective cohort study with electronic health record data from 120,000 inpatient admissions at an urban, academic center in Manhattan. The primary outcome was thirty-day readmission for three causes: all-cause, congestive heart failure, and chronic coronary atherosclerotic disease. Predictive modeling was performed via L1-regularized logistic regression. Calibration methods were compared including Platt Scaling, Logistic Calibration, and Prevalence Adjustment. Performance of predictive modeling and calibration was assessed via discrimination (c-statistic), calibration (Spiegelhalter Z-statistic, Root Mean Square Error [RMSE] of binned predictions, Sanders and Murphy Resolutions of the Brier Score, Calibration Slope and Intercept), and clinical usefulness (utility terms represented as costs). The amount of validation data necessary to apply each calibration algorithm was also assessed. Results: C-statistics by diagnosis ranged from 0.7 for all-cause readmission to 0.86 (0.78-0.93) for congestive heart failure. Logistic Calibration and Platt Scaling performed best and this difference required analyzing multiple metrics of calibration simultaneously, in particular Calibration Slopes and Intercepts. Clinical usefulness analyses provided optimal risk thresholds, which varied by reason for readmission, outcome prevalence, and calibration algorithm. Utility analyses also suggested maximum tolerable intervention costs, e.g., $1720 for all-cause readmissions based on a published cost of readmission of $11,862. Conclusions: Choice of calibration method depends on availability of validation data and on performance. Improperly calibrated models may contribute to higher costs of intervention as measured via clinical usefulness. Decision-makers must understand underlying utilities or costs inherent in the use-case at hand to assess usefulness and will obtain the optimal risk threshold to trigger intervention with intervention cost limits as a result.

The harms of class imbalance corrections for machine learning based prediction models: a simulation study

The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression

Risk prediction models for discrete ordinal outcomes: calibration and the impact of the proportional odds assumption

Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies

Implications of resampling data to address the class imbalance problem (IRCIP): an evaluation of impact on performance between classification algorithms in medical data

A causal viewpoint on prediction model performance under changes in case-mix: discrimination and calibration respond differently for prognosis and diagnosis predictions

Assessing the Impact of Case Correction Methods on the Fairness of COVID-19 Predictive Models

Causal Inference and Counterfactual Prediction in Machine Learning for Actionable Healthcare

Imbalanced class distribution and performance evaluation metrics: A systematic review of prediction accuracy for determining model performance in healthcare systems

Counterfactual Prediction Under Outcome Measurement Error

Addressing bias in prediction models by improving subpopulation calibration

On the variability of regression shrinkage methods for clinical prediction models: simulation study on predictive performance

Fair admission risk prediction with proportional multicalibration

Evaluating gender bias in ML-based clinical risk prediction models: A study on multiple use cases at different hospitals

Monitoring machine learning (ML)-based risk prediction algorithms in the presence of confounding medical interventions

When accurate prediction models yield harmful self-fulfilling prophecies

Calibration plots for multistate risk predictions models: an overview and simulation comparing novel approaches

Subpopulation-specific machine learning prognosis for underrepresented patients with double prioritized bias correction

Beyond discrimination: A comparison of calibration methods and clinical usefulness of predictive models of readmission risk

Calibration plots for multistate risk predictions models

Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control