Abstract:This paper studies the performance of large language models (LLMs), particularly regarding demographic fairness, in solving real-world healthcare tasks. We evaluate state-of-the-art LLMs with three prevalent learning frameworks across six diverse healthcare tasks and find significant challenges in applying LLMs to real-world healthcare tasks and persistent fairness issues across demographic groups. We also find that explicitly providing demographic information yields mixed results, while LLM's ability to infer such details raises concerns about biased health predictions. Utilizing LLMs as autonomous agents with access to up-to-date guidelines does not guarantee performance improvement. We believe these findings reveal the critical limitations of LLMs in healthcare fairness and the urgent need for specialized research in this area.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to study the performance of large language models (LLMs) in solving practical medical tasks, especially from the perspective of demographic fairness. Specifically, the paper focuses on the following aspects: 1. **Evaluating the performance of LLMs in different medical tasks**: - The paper selected six different medical tasks, including mortality prediction, readmission prediction, health coaching outcome prediction, and mental health diagnosis, etc. - Evaluated the performance of three state - of - the - art LLMs (GPT - 4, Claude - 3, and LLaMA - 3) in these tasks. 2. **Exploring the impact of demographic information on the performance of LLMs**: - Studied the impact of explicitly providing demographic information (such as age, gender, race) on the model's performance. - Analyzed whether LLMs can infer this information without explicit demographic information and explored whether such inferences would lead to bias. 3. **Evaluating the performance of LLMs under different frameworks**: - Compared the performance of three frameworks in medical tasks: in - context learning (ICL), parameter - efficient fine - tuning (PEFT), and LLM as agent. 4. **Quantifying demographic fairness**: - Used two standard fairness metrics - demographic parity difference (DPD) and equal opportunity difference (EOD) - to quantify the performance differences between different racial and gender groups. ### Main findings 1. **Poor performance of LLMs in medical tasks**: - Although LLMs perform well in other fields, in practical medical tasks, their performance is generally poor, and many implementations are even only slightly above the baseline level of random guessing. 2. **The impact of demographic information is complex**: - Explicitly providing demographic information does not always improve the model's performance or fairness. Sometimes it even exacerbates unfairness. - LLMs can infer demographic information from conversations, but such inferences may be severely biased, affecting the accuracy of health predictions. 3. **Different frameworks have different effects**: - For some tasks (such as mental illness diagnosis), the fine - tuning framework performs best; while for other tasks (such as tasks in the MIMIC dataset), the in - context learning framework performs better. - The LLM - as - agent method performs excellently in the MedQA task but has a poor effect in practical medical applications. 4. **There is significant demographic unfairness**: - Under different tasks and frameworks, LLMs have significant unfairness in predicting the results of different racial and gender groups, especially in predicting the results of African - Americans. - The values of demographic parity difference (DPD) and equal opportunity difference (EOD) indicate that LLMs have obvious biases in medical predictions. ### Conclusion The paper reveals the key limitations of LLMs in medical applications, especially in terms of demographic fairness. The research results emphasize the need for specialized research when using LLMs in the medical field to ensure the fairness and accuracy of the models.

Unveiling Performance Challenges of Large Language Models in Low-Resource Healthcare: A Demographic Fairness Perspective

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics

Mitigating the Risk of Health Inequity Exacerbated by Large Language Models

Evaluating and Addressing Demographic Disparities in Medical Large Language Models: A Systematic Review

Empathy and Equity: Key Considerations for Large Language Model Adoption in Health Care

Unveiling and Mitigating Bias in Mental Health Analysis with Large Language Models

Leveraging large language models to foster equity in healthcare

Large Language Models in Healthcare: A Comprehensive Benchmark

Beyond Multiple-Choice Accuracy: Real-World Challenges of Implementing Large Language Models in Healthcare

Large language models in medical and healthcare fields: applications, advances, and challenges

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

A Survey on Fairness in Large Language Models

Fairness in Large Language Models: A Taxonomic Survey

DiversityMedQA: Assessing Demographic Biases in Medical Diagnosis using Large Language Models

FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks

Bias patterns in the application of LLMs for clinical decision support: A comprehensive study

A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models

Large Language Models Illuminate a Progressive Pathway to Artificial Healthcare Assistant: A Review

Synthesis and 5α-reductase inhibitory activity of 8-substituted benzo[ƒ]quinolinones derived from palladium mediated coupling reactions

Evaluating large language models in medical applications: a survey

Fairness in Large Language Models in Three Hours