Abstract:Background: The adoption of predictive algorithms in health care comes with the potential for algorithmic bias, which could exacerbate existing disparities. Fairness metrics have been proposed to measure algorithmic bias, but their application to real-world tasks is limited. Objective: This study aims to evaluate the algorithmic bias associated with the application of common 30-day hospital readmission models and assess the usefulness and interpretability of selected fairness metrics. Methods: We used 10.6 million adult inpatient discharges from Maryland and Florida from 2016 to 2019 in this retrospective study. Models predicting 30-day hospital readmissions were evaluated: LACE Index, modified HOSPITAL score, and modified Centers for Medicare & Medicaid Services (CMS) readmission measure, which were applied as-is (using existing coefficients) and retrained (recalibrated with 50% of the data). Predictive performances and bias measures were evaluated for all, between Black and White populations, and between low- and other-income groups. Bias measures included the parity of false negative rate (FNR), false positive rate (FPR), 0-1 loss, and generalized entropy index. Racial bias represented by FNR and FPR differences was stratified to explore shifts in algorithmic bias in different populations. Results: The retrained CMS model demonstrated the best predictive performance (area under the curve: 0.74 in Maryland and 0.68-0.70 in Florida), and the modified HOSPITAL score demonstrated the best calibration (Brier score: 0.16-0.19 in Maryland and 0.19-0.21 in Florida). Calibration was better in White (compared to Black) populations and other-income (compared to low-income) groups, and the area under the curve was higher or similar in the Black (compared to White) populations. The retrained CMS and modified HOSPITAL score had the lowest racial and income bias in Maryland. In Florida, both of these models overall had the lowest income bias and the modified HOSPITAL score showed the lowest racial bias. In both states, the White and higher-income populations showed a higher FNR, while the Black and low-income populations resulted in a higher FPR and a higher 0-1 loss. When stratified by hospital and population composition, these models demonstrated heterogeneous algorithmic bias in different contexts and populations. Conclusions: Caution must be taken when interpreting fairness measures' face value. A higher FNR or FPR could potentially reflect missed opportunities or wasted resources, but these measures could also reflect health care use patterns and gaps in care. Simply relying on the statistical notions of bias could obscure or underplay the causes of health disparity. The imperfect health data, analytic frameworks, and the underlying health systems must be carefully considered. Fairness measures can serve as a useful routine assessment to detect disparate model performances but are insufficient to inform mechanisms or policy changes. However, such an assessment is an important first step toward data-driven improvement to address existing health disparities.

Evaluating the Fairness of the MIMIC-IV Dataset and a Baseline Algorithm: Application to the ICU Length of Stay Prediction

MIMIC-IF: Interpretability and Fairness Evaluation of Deep Learning Models on MIMIC-IV Dataset

Interpretability and fairness evaluation of deep learning models on MIMIC-IV dataset

Establishment of ICU Mortality Risk Prediction Models with Machine Learning Algorithm Using MIMIC-IV Database

Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical ML

Optimizing Mortality Prediction for ICU Heart Failure Patients: Leveraging XGBoost and Advanced Machine Learning with the MIMIC-III Database

Monitoring fairness in machine learning models that predict patient mortality in the ICU

Evaluating Algorithmic Bias in 30-Day Hospital Readmission Models: Retrospective Analysis

Fairness gaps in Machine learning models for hospitalization and emergency department visit risk prediction in home healthcare patients with heart failure

Explainable Machine Learning for ICU Readmission Prediction

Fairness in Computational Innovations: Identifying Bias in Substance Use Treatment Length of Stay Prediction Models with Policy Implications

A Comparative Study of Fairness in Medical Machine Learning.

Using machine learning in prediction of ICU admission, mortality, and length of stay in the early stage of admission of COVID-19 patients

Can AI Help Reduce Disparities in General Medical and Mental Health Care?

Availability of information needed to evaluate algorithmic fairness - A systematic review of publicly accessible critical care databases

Machine learning for in-hospital mortality prediction in critically ill patients with acute heart failure: A retrospective analysis based on MIMIC -IV database

Criticality of Nursing Care for Patients With Alzheimer's Disease in the ICU: Insights From MIMIC III Dataset

Intersectional consequences for marginal fairness in prediction models of emergency admissions

Variation in model performance by data cleanliness and classification methods in the prediction of 30-day ICU mortality, a US nationwide retrospective cohort and simulation study

Predicting serious postoperative complications and evaluating racial fairness in machine learning algorithms for metabolic and bariatric surgery

Improvement of APACHE II score system for disease severity based on XGBoost algorithm