Evaluating and Addressing Demographic Disparities in Medical Large Language Models: A Systematic Review

Mahmud Omar Sr.,Vera Sorin Sr.,Reem Agbareia,Donald U Apakama,Ali Soroush,Ankit Sakhuja,Robert Freeman,Carol R Horowitz,Lynne D Richardson,Girish Nadkarni,Eyal Klang
DOI: https://doi.org/10.1101/2024.09.09.24313295
2024-10-01
Abstract:Background: Large language models (LLMs) are increasingly evaluated for use in healthcare. However, concerns about their impact on disparities persist. This study reviews current research on demographic biases in LLMs to identify prevalent bias types, assess measurement methods, and evaluate mitigation strategies. Methods: We conducted a systematic review, searching publications from January 2018 to July 2024 across five databases. We included peer-reviewed studies evaluating demographic biases in LLMs, focusing on gender, race, ethnicity, age, and other factors. Study quality was assessed using the Joanna Briggs Institute Critical Appraisal Tools. Results: Our review included 24 studies. Of these, 22 (91.7%) identified biases in LLMs. Gender bias was the most prevalent, reported in 15 of 16 studies (93.7%). Racial or ethnic biases were observed in 10 of 11 studies (90.9%). Only two studies found minimal or no bias in certain contexts. Mitigation strategies mainly included prompt engineering, with varying effectiveness. However, these findings are tempered by a potential publication bias, as studies with negative results are less frequently published. Conclusion: Biases are observed in LLMs across various medical domains. While bias detection is improving, effective mitigation strategies are still developing. As LLMs increasingly influence critical decisions, addressing these biases and their resultant disparities is essential for ensuring fair AI systems. Future research should focus on a wider range of demographic factors, intersectional analyses, and non-Western cultural contexts.
Health Systems and Quality Improvement
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate and address ethnic group differences in large - language models (LLMs) in the medical field. Specifically: 1. **Identifying prevalent types of bias**: Through a systematic review of existing research, the paper aims to identify different types of ethnic group biases that are common in medical LLMs. These biases include, but are not limited to, factors such as gender, race, ethnicity, and age. 2. **Evaluating measurement methods**: The paper explores the effectiveness and limitations of current methods used to detect these biases. Different studies have employed a variety of methods to quantify and evaluate biases, such as prompt testing, corpus analysis, specific - task evaluation, and sentiment analysis. 3. **Evaluating mitigation strategies**: The paper also evaluates the effectiveness of existing mitigation strategies, such as prompt engineering and de - biasing algorithms. Although some studies suggest that these methods can reduce certain types of biases, their effectiveness varies depending on the type of bias and the application scenario. ### Specific problem description - **Background and motivation**: With the increasing widespread use of LLMs in the healthcare field, their impact on ethnic group differences has attracted wide - spread attention. These models may produce unfair outputs due to inherent biases in the training data, which in turn can affect clinical decision - making and other key medical tasks. Therefore, it is crucial to ensure the fairness and accuracy of these models. - **Research objectives**: - Identify the prevalent types of biases in medical LLMs. - Evaluate the methods used to detect these biases. - Evaluate existing mitigation strategies and make improvement suggestions. ### Method overview To achieve the above - mentioned objectives, the authors conducted a systematic review and searched for relevant literature published from January 2018 to July 2024. The study selected five databases (PubMed, Embase, Web of Science, APA PsycInfo, and Scopus) and used strict screening criteria. Eventually, a total of 24 studies met the inclusion criteria, covering multiple LLMs and their applications in different medical tasks. ### Main findings - **Biases are prevalent**: Among the 24 studies, 22 (91.7%) found biases to varying degrees. Among them, gender bias was the most common, appearing in 16 out of 15 studies (93.7%), followed by racial or ethnic bias, appearing in 11 out of 10 studies (90.9%). - **Mitigation strategies have limited effectiveness**: Although some studies have attempted to mitigate biases through prompt engineering and de - biasing algorithms, the effectiveness of these methods is inconsistent and lacks standardized evaluation metrics. ### Conclusions and future directions The paper emphasizes that although the techniques for detecting biases are improving, effective mitigation strategies are still under development. To ensure fair AI systems, future research should focus on broader ethnic group factors, cross - analysis, and applications in non - Western cultural contexts. In addition, more standardized bias evaluation and mitigation methods need to be developed and validated. ### Summary The core issue of this paper is to evaluate and address ethnic group biases in medical LLMs to ensure the fairness and accuracy of these models in practical applications. Through a systematic review of existing research, the authors revealed the prevalence and diversity of biases and pointed out the shortcomings of current mitigation strategies, providing important directions for future research.