Unveiling Performance Challenges of Large Language Models in Low-Resource Healthcare: A Demographic Fairness Perspective

Yue Zhou,Barbara Di Eugenio,Lu Cheng
2024-12-01
Abstract:This paper studies the performance of large language models (LLMs), particularly regarding demographic fairness, in solving real-world healthcare tasks. We evaluate state-of-the-art LLMs with three prevalent learning frameworks across six diverse healthcare tasks and find significant challenges in applying LLMs to real-world healthcare tasks and persistent fairness issues across demographic groups. We also find that explicitly providing demographic information yields mixed results, while LLM's ability to infer such details raises concerns about biased health predictions. Utilizing LLMs as autonomous agents with access to up-to-date guidelines does not guarantee performance improvement. We believe these findings reveal the critical limitations of LLMs in healthcare fairness and the urgent need for specialized research in this area.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to study the performance of large language models (LLMs) in solving practical medical tasks, especially from the perspective of demographic fairness. Specifically, the paper focuses on the following aspects: 1. **Evaluating the performance of LLMs in different medical tasks**: - The paper selected six different medical tasks, including mortality prediction, readmission prediction, health coaching outcome prediction, and mental health diagnosis, etc. - Evaluated the performance of three state - of - the - art LLMs (GPT - 4, Claude - 3, and LLaMA - 3) in these tasks. 2. **Exploring the impact of demographic information on the performance of LLMs**: - Studied the impact of explicitly providing demographic information (such as age, gender, race) on the model's performance. - Analyzed whether LLMs can infer this information without explicit demographic information and explored whether such inferences would lead to bias. 3. **Evaluating the performance of LLMs under different frameworks**: - Compared the performance of three frameworks in medical tasks: in - context learning (ICL), parameter - efficient fine - tuning (PEFT), and LLM as agent. 4. **Quantifying demographic fairness**: - Used two standard fairness metrics - demographic parity difference (DPD) and equal opportunity difference (EOD) - to quantify the performance differences between different racial and gender groups. ### Main findings 1. **Poor performance of LLMs in medical tasks**: - Although LLMs perform well in other fields, in practical medical tasks, their performance is generally poor, and many implementations are even only slightly above the baseline level of random guessing. 2. **The impact of demographic information is complex**: - Explicitly providing demographic information does not always improve the model's performance or fairness. Sometimes it even exacerbates unfairness. - LLMs can infer demographic information from conversations, but such inferences may be severely biased, affecting the accuracy of health predictions. 3. **Different frameworks have different effects**: - For some tasks (such as mental illness diagnosis), the fine - tuning framework performs best; while for other tasks (such as tasks in the MIMIC dataset), the in - context learning framework performs better. - The LLM - as - agent method performs excellently in the MedQA task but has a poor effect in practical medical applications. 4. **There is significant demographic unfairness**: - Under different tasks and frameworks, LLMs have significant unfairness in predicting the results of different racial and gender groups, especially in predicting the results of African - Americans. - The values of demographic parity difference (DPD) and equal opportunity difference (EOD) indicate that LLMs have obvious biases in medical predictions. ### Conclusion The paper reveals the key limitations of LLMs in medical applications, especially in terms of demographic fairness. The research results emphasize the need for specialized research when using LLMs in the medical field to ensure the fairness and accuracy of the models.