Evaluating Prognostic Bias of Critical Illness Severity Scores Based on Age, Sex, and Primary Language in the United States: A Retrospective Multicenter Study.

Xiaoli Liu,Max Shen,Margaret Lie,Zhongheng Zhang,Chao Liu,Deyu Li,Roger G. Mark,Zhengbo Zhang,Leo Anthony Celi
DOI: https://doi.org/10.1097/cce.0000000000001033
2024-01-01
Critical Care Explorations
Abstract:OBJECTIVES:Although illness severity scoring systems are widely used to support clinical decision-making and assess ICU performance, their potential bias across different age, sex, and primary language groups has not been well-studied.DESIGN SETTING AND PATIENTS:We aimed to identify potential bias of Sequential Organ Failure Assessment (SOFA) and Acute Physiology and Chronic Health Evaluation (APACHE) IVa scores via large ICU databases.SETTING/PATIENTS:This multicenter, retrospective study was conducted using data from the Medical Information Mart for Intensive Care (MIMIC) and eICU Collaborative Research Database. SOFA and APACHE IVa scores were obtained from ICU admission. Hospital mortality was the primary outcome. Discrimination (area under receiver operating characteristic [AUROC] curve) and calibration (standardized mortality ratio [SMR]) were assessed for all subgroups.INTERVENTIONS:Not applicable.MEASUREMENTS AND MAIN RESULTS:A total of 196,310 patient encounters were studied. Discrimination for both scores was worse in older patients compared with younger patients and female patients rather than male patients. In MIMIC, discrimination of SOFA in non-English primary language speakers patients was worse than that of English speakers (AUROC 0.726 vs. 0.783, p < 0.0001). Evaluating calibration via SMR showed statistically significant underestimations of mortality when compared with overall cohort in the oldest patients for both SOFA and APACHE IVa, female patients (1.09) for SOFA, and non-English primary language patients (1.38) for SOFA in MIMIC.CONCLUSIONS:Differences in discrimination and calibration of two scores across varying age, sex, and primary language groups suggest illness severity scores are prone to bias in mortality predictions. Caution must be taken when using them for quality benchmarking and decision-making among diverse real-world populations.
What problem does this paper attempt to address?