Abstract:Background: The adoption of predictive algorithms in health care comes with the potential for algorithmic bias, which could exacerbate existing disparities. Fairness metrics have been proposed to measure algorithmic bias, but their application to real-world tasks is limited. Objective: This study aims to evaluate the algorithmic bias associated with the application of common 30-day hospital readmission models and assess the usefulness and interpretability of selected fairness metrics. Methods: We used 10.6 million adult inpatient discharges from Maryland and Florida from 2016 to 2019 in this retrospective study. Models predicting 30-day hospital readmissions were evaluated: LACE Index, modified HOSPITAL score, and modified Centers for Medicare & Medicaid Services (CMS) readmission measure, which were applied as-is (using existing coefficients) and retrained (recalibrated with 50% of the data). Predictive performances and bias measures were evaluated for all, between Black and White populations, and between low- and other-income groups. Bias measures included the parity of false negative rate (FNR), false positive rate (FPR), 0-1 loss, and generalized entropy index. Racial bias represented by FNR and FPR differences was stratified to explore shifts in algorithmic bias in different populations. Results: The retrained CMS model demonstrated the best predictive performance (area under the curve: 0.74 in Maryland and 0.68-0.70 in Florida), and the modified HOSPITAL score demonstrated the best calibration (Brier score: 0.16-0.19 in Maryland and 0.19-0.21 in Florida). Calibration was better in White (compared to Black) populations and other-income (compared to low-income) groups, and the area under the curve was higher or similar in the Black (compared to White) populations. The retrained CMS and modified HOSPITAL score had the lowest racial and income bias in Maryland. In Florida, both of these models overall had the lowest income bias and the modified HOSPITAL score showed the lowest racial bias. In both states, the White and higher-income populations showed a higher FNR, while the Black and low-income populations resulted in a higher FPR and a higher 0-1 loss. When stratified by hospital and population composition, these models demonstrated heterogeneous algorithmic bias in different contexts and populations. Conclusions: Caution must be taken when interpreting fairness measures' face value. A higher FNR or FPR could potentially reflect missed opportunities or wasted resources, but these measures could also reflect health care use patterns and gaps in care. Simply relying on the statistical notions of bias could obscure or underplay the causes of health disparity. The imperfect health data, analytic frameworks, and the underlying health systems must be carefully considered. Fairness measures can serve as a useful routine assessment to detect disparate model performances but are insufficient to inform mechanisms or policy changes. However, such an assessment is an important first step toward data-driven improvement to address existing health disparities.

Imbalanced class distribution and performance evaluation metrics: A systematic review of prediction accuracy for determining model performance in healthcare systems

Empirical analysis of performance assessment for imbalanced classification

Evaluating classifier performance with highly imbalanced Big Data

Handling imbalanced medical datasets: review of a decade of research

A review on over-sampling techniques in classification of multi-class imbalanced datasets: insights for medical problems

A Study of Data Pre-processing Techniques for Imbalanced Biomedical Data Classification

Measuring Class-Imbalance Sensitivity of Deterministic Performance Evaluation Metrics

Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance in Binary Classification

Addressing Binary Classification over Class Imbalanced Clinical Datasets Using Computationally Intelligent Techniques

Imbalanced Data Classification:A Survey and Experiments in Medical Domain

Fairness gaps in Machine learning models for hospitalization and emergency department visit risk prediction in home healthcare patients with heart failure

Selecting the Suitable Resampling Strategy for Imbalanced Data Classification Regarding Dataset Properties. An Approach Based on Association Models

The harms of class imbalance corrections for machine learning based prediction models: a simulation study

Evaluating Algorithmic Bias in 30-Day Hospital Readmission Models: Retrospective Analysis

Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties

Implications of resampling data to address the class imbalance problem (IRCIP): an evaluation of impact on performance between classification algorithms in medical data

Equity in Healthcare: Analyzing Disparities in Machine Learning Predictions of Diabetic Patient Readmissions

Appropriateness of Performance Indices for Imbalanced Data Classification: An Analysis

A Survey of Methods for Managing the Classification and Solution of Data Imbalance Problem

The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models

Hybrid approaches for handling imbalanced structured and unstructured data