Abstract:Biased AI-generated medical advice and misdiagnoses can jeopardize patient safety, making the integrity of AI in healthcare more critical than ever. As Large Language Models (LLMs) take on a growing role in medical decision-making, addressing their biases and enhancing their accuracy is key to delivering safe, reliable care. This study addresses these challenges head-on by introducing new resources designed to promote ethical and precise AI in healthcare. We present two datasets: BiasMD, featuring 6,007 question-answer pairs crafted to evaluate and mitigate biases in health-related LLM outputs, and DiseaseMatcher, with 32,000 clinical question-answer pairs spanning 700 diseases, aimed at assessing symptom-based diagnostic accuracy. Using these datasets, we developed the EthiClinician, a fine-tuned model built on the ChatDoctor framework, which outperforms GPT-4 in both ethical reasoning and clinical judgment. By exposing and correcting hidden biases in existing models for healthcare, our work sets a new benchmark for safer, more reliable patient outcomes.

What problem does this paper attempt to address?

This paper attempts to solve two main problems in the application of large - language models (LLMs) in the healthcare field: **Bias** and **Diagnostic Accuracy**. Specifically: 1. **Bias problem**: - **Background**: AI - generated medical advice and misdiagnosis may endanger patient safety. Therefore, it is crucial to ensure the fairness and reliability of models in AI applications in the healthcare field. - **Challenge**: Existing LLMs may generate biases when generating medical - related advice, and these biases may exacerbate the negative stereotypes of society towards certain diseases or populations, thus affecting healthcare awareness and treatment progress. - **Solution**: The paper introduced the BiasMD dataset, which contains 6,007 question - answer pairs, aiming to evaluate and mitigate biases in health - related LLM outputs. 2. **Diagnostic accuracy problem**: - **Background**: Although the application of LLMs in the medical field shows great potential, they still have limitations in providing clear and well - founded medical knowledge. - **Challenge**: Even models fine - tuned with medical data do not perform satisfactorily in terms of medical knowledge. For example, ChatGPT has an accuracy rate of only 52% in the United States Medical Licensing Examination (USMLE). - **Solution**: The paper introduced the DiseaseMatcher dataset, which contains 32,000 clinical question - answer pairs covering 700 diseases and their symptoms, aiming to evaluate symptom - based diagnostic accuracy. 3. **Integrated solution**: - **EthiClinician model**: A model fine - tuned based on the ChatDoctor framework. EthiClinician performs better than GPT - 4 on both the BiasMD and DiseaseMatcher datasets, especially in ethical reasoning and clinical judgment. - **Contribution**: Through these datasets and models, the paper provides new benchmarks for evaluating and improving the ethics and accuracy of LLMs in the healthcare field. ### Main contributions 1. **BiasMD dataset**: Contains 6,007 question - answer pairs for evaluating and fine - tuning the ethical responses of LLMs in different demographic groups. 2. **DiseaseMatcher dataset**: Contains 32,000 clinical question - answer pairs covering 700 diseases and their symptoms for evaluating and enhancing the medical reasoning ability of LLMs. 3. **EthiClinician model**: A model fine - tuned based on the ChatDoctor framework, which significantly improves ethics and diagnostic accuracy. ### Results - **BiasMD performance**: EthiClinician achieves almost completely unbiased accuracy on the BiasMD dataset, while GPT - 4 has an accuracy rate of 90.1%. - **DiseaseMatcher performance**: EthiClinician has an accuracy rate of 92.47% on the DiseaseMatcher dataset, which is significantly higher than other models, such as GPT - 4 (82.84%), Llama2 - 7B (20.4%) and ChatDoctor (51.44%). ### Discussion - **Importance of ethics and accuracy**: In the healthcare field, it is crucial to ensure the fairness and accuracy of LLMs, because demographic differences may exacerbate existing biases. - **Future directions**: There is a need to further expand datasets to better represent a wider range of human identities, so as to build more inclusive AI systems. In conclusion, this paper provides important tools and methods for improving the ethics and diagnostic accuracy of LLMs in the healthcare field by introducing new datasets and models.

Detecting Bias and Enhancing Diagnostic Accuracy in Large Language Models for Healthcare

Evaluation and mitigation of cognitive biases in medical language models

Bias patterns in the application of LLMs for clinical decision support: A comprehensive study

How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making?

Addressing cognitive bias in medical language models

A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models

Evaluating and Addressing Demographic Disparities in Medical Large Language Models: A Systematic Review

Enabling Scalable Evaluation of Bias Patterns in Medical LLMs

Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine

Large Language Models in Healthcare: A Comprehensive Benchmark

Unveiling and Mitigating Bias in Mental Health Analysis with Large Language Models

Benchmarking the Confidence of Large Language Models in Clinical Questions

Enhancing Diagnostic Accuracy through Multi-Agent Conversations: Using Large Language Models to Mitigate Cognitive Bias

Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

Mitigating the Risk of Health Inequity Exacerbated by Large Language Models

Empathy and Equity: Key Considerations for Large Language Model Adoption in Health Care

Large Language Model Prompting Techniques for Advancement in Clinical Medicine

Self-Diagnosis and Large Language Models: A New Front for Medical Misinformation

Evaluating Anti-LGBTQIA+ Medical Bias in Large Language Models

Language models are susceptible to incorrect patient self-diagnosis in medical applications