Assessing equitable use of large language models for clinical decision support in real-world settings: fine-tuning and internal-external validation using electronic health records from South Asia
Seyed Alireza Hasheminasab,Faisal Jamil,Muhammad Usman Afzal,Ali Haider Khan,Sehrish Ilyas,Ali Noor,Salma Abbas,Hajira Nisar Cheema,Muhammad Usman Shabbir,Iqra Hameed,Maleeha Ayub,Hamayal Masood,Amina Jafar,Amir Mukhtar Khan,Muhammad Abid Nazir,Muhammad Asaad Jamil,Faisal Sultan,Sara Khalid
DOI: https://doi.org/10.1101/2024.06.05.24308365
2024-06-05
Abstract:Objective
Fair and safe Large Language Models (LLMs) hold the potential for clinical task-shifting which, if done reliably, can benefit over-burdened healthcare systems, particularly for resource-limited settings and traditionally overlooked populations. However, this powerful technology remains largely understudied in real-world contexts, particularly in the global South. This study aims to assess if openly available LLMs can be used equitably and reliably for processing medical notes in real-world settings in South Asia.
Methods
We used publicly available medical LLMs to parse clinical notes from a large electronic health records (EHR) database in Pakistan. ChatGPT, GatorTron, BioMegatron, BioBert and ClinicalBERT were tested for bias when applied to these data, after fine-tuning them to a) publicly available clinical datasets I2B2 and N2C2 for medical concept extraction (MCE) and emrQA for medical question answering (MQA), and b) the local EHR dataset. For MCE models were applied to clinical notes with 3-label and 9-label formats and for MQA were applied to medical questions. Internal and external validation performance was measured for a) and b) using F1, precision, recall, and accuracy for MCE and BLEU and ROUGE-L for MQA.
Results
LLMs not fine-tuned to the local EHR dataset performed poorly, suggesting bias, when externally validated on it. Fine-tuning the LLMs to the local EHR data improved model performance. Specifically, the 3- label precision, recall, F1 score, and accuracy for the dataset improved by 21-31%, 11-21%, 16-27%, and 6-10% amongst GatorTron, BioMegatron, BioBert and ClinicalBERT. As an exception, ChatGPT performed better on the local EHR dataset by 10% for precision and 13% for each of recall, F1 score, and accuracy. 9-label performance trends were similar.
Conclusions
Publicly available LLMs, predominantly trained in global north settings, were found to be biased when used in a real-world clinical setting. Fine-tuning them to local data and clinical contexts can help improve their reliable and equitable use in resource-limited settings. Close collaboration between clinical and technical experts can ensure responsible and unbiased powerful tech accessible to resource-limited, overburdened settings used in ways that are safe, fair, and beneficial for all.