Identifying and Characterizing Bias at Scale in Clinical Notes Using Large Language Models
Donald U Apakama,Kim-Anh-Nhi Nguyen,Daphnee Hyppolite,Shelly Soffer,Aya Mudrik,Emilia Ling,Akini Moses,Ivanka Temnycky,Allison Glasser,Rebecca Anderson,Prathamesh Parchure,Evajoyce Woullard,Masoud Edalati,Lili Chan,Clair Kronk,Robert Freeman,Arash Kia,Prem Timsina,Matthew Levin,Rohan Khera,Patricia Patricia Kovatch,Alexander W. Charney,Brendan G. Carr,Lynne D. Richardson,Carol R. Horowitz,Eyal Klang,Girish Nadkarni
DOI: https://doi.org/10.1101/2024.10.24.24316073
2024-10-25
Abstract:Importance. Discriminatory language in clinical documentation impacts patient care and reinforces systemic biases. Scalable tools to detect and mitigate this are needed.
Objective. Determine utility of a frontier large language model (GPT-4) in identifying and categorizing biased language and evaluate its suggestions for debiasing.
Design. Cross-sectional study analyzing emergency department (ED) notes from the Mount Sinai Health System (MSHS) and discharge notes from MIMIC-IV.
Setting. MSHS, a large urban healthcare system, and MIMIC-IV, a public dataset.
Participants. We randomly selected 50,000 ED medical and nursing notes from 230,967 MSHS 2023 adult ED visiting patients, and 500 randomly selected discharge notes from 145,915 patients in MIMIC-IV database. One note was selected for each unique patient.
Main Outcomes and Measures. Primary measure was accuracy of detection and categorization (discrediting, stigmatizing/labeling, judgmental, and stereotyping) of bias compared to human review. Secondary measures were proportion of patients with any bias, differences in the prevalence of bias across demographic and socioeconomic subgroups, and provider ratings of effectiveness of GPT-4's debiasing language.
Results. Bias was detected in 6.5% of MSHS and 7.4% of MIMIC-IV notes. Compared to manual review, GPT-4 had sensitivity of 95%, specificity of 86%, positive predictive value of 84% and negative predictive value of 96% for bias detection. Stigmatizing/labeling (3.4%), judgmental (3.2%), and discrediting (4.0%) biases were most prevalent. There was higher bias in Black patients (8.3%), transgender individuals (15.7% for trans-female, 16.7% for trans-male), and undomiciled individuals (27%). Patients with non-commercial insurance, particularly Medicaid, also had higher bias (8.9%). Higher bias was also seen in health-related characteristics like frequent healthcare utilization (21% for >100 visits) and substance use disorders (32.2%). Physician-authored notes showed higher bias than nursing notes (9.4% vs. 4.2%, p < 0.001). GPT-4's suggested revisions were rated highly effective by physicians, with an average improvement score of 9.6/10 in reducing bias.
Conclusions and Relevance. A frontier LLM effectively identified biased language, without further training, showing utility as a scalable fairness tool. High bias prevalence linked to certain patient characteristics underscores the need for targeted interventions. Integrating AI to facilitate unbiased documentation could significantly impact clinical practice and health outcomes.
Emergency Medicine