Detecting Bias and Enhancing Diagnostic Accuracy in Large Language Models for Healthcare

Pardis Sadat Zahraei,Zahra Shakeri
2024-10-09
Abstract:Biased AI-generated medical advice and misdiagnoses can jeopardize patient safety, making the integrity of AI in healthcare more critical than ever. As Large Language Models (LLMs) take on a growing role in medical decision-making, addressing their biases and enhancing their accuracy is key to delivering safe, reliable care. This study addresses these challenges head-on by introducing new resources designed to promote ethical and precise AI in healthcare. We present two datasets: BiasMD, featuring 6,007 question-answer pairs crafted to evaluate and mitigate biases in health-related LLM outputs, and DiseaseMatcher, with 32,000 clinical question-answer pairs spanning 700 diseases, aimed at assessing symptom-based diagnostic accuracy. Using these datasets, we developed the EthiClinician, a fine-tuned model built on the ChatDoctor framework, which outperforms GPT-4 in both ethical reasoning and clinical judgment. By exposing and correcting hidden biases in existing models for healthcare, our work sets a new benchmark for safer, more reliable patient outcomes.
Computation and Language
What problem does this paper attempt to address?
This paper attempts to solve two main problems in the application of large - language models (LLMs) in the healthcare field: **Bias** and **Diagnostic Accuracy**. Specifically: 1. **Bias problem**: - **Background**: AI - generated medical advice and misdiagnosis may endanger patient safety. Therefore, it is crucial to ensure the fairness and reliability of models in AI applications in the healthcare field. - **Challenge**: Existing LLMs may generate biases when generating medical - related advice, and these biases may exacerbate the negative stereotypes of society towards certain diseases or populations, thus affecting healthcare awareness and treatment progress. - **Solution**: The paper introduced the BiasMD dataset, which contains 6,007 question - answer pairs, aiming to evaluate and mitigate biases in health - related LLM outputs. 2. **Diagnostic accuracy problem**: - **Background**: Although the application of LLMs in the medical field shows great potential, they still have limitations in providing clear and well - founded medical knowledge. - **Challenge**: Even models fine - tuned with medical data do not perform satisfactorily in terms of medical knowledge. For example, ChatGPT has an accuracy rate of only 52% in the United States Medical Licensing Examination (USMLE). - **Solution**: The paper introduced the DiseaseMatcher dataset, which contains 32,000 clinical question - answer pairs covering 700 diseases and their symptoms, aiming to evaluate symptom - based diagnostic accuracy. 3. **Integrated solution**: - **EthiClinician model**: A model fine - tuned based on the ChatDoctor framework. EthiClinician performs better than GPT - 4 on both the BiasMD and DiseaseMatcher datasets, especially in ethical reasoning and clinical judgment. - **Contribution**: Through these datasets and models, the paper provides new benchmarks for evaluating and improving the ethics and accuracy of LLMs in the healthcare field. ### Main contributions 1. **BiasMD dataset**: Contains 6,007 question - answer pairs for evaluating and fine - tuning the ethical responses of LLMs in different demographic groups. 2. **DiseaseMatcher dataset**: Contains 32,000 clinical question - answer pairs covering 700 diseases and their symptoms for evaluating and enhancing the medical reasoning ability of LLMs. 3. **EthiClinician model**: A model fine - tuned based on the ChatDoctor framework, which significantly improves ethics and diagnostic accuracy. ### Results - **BiasMD performance**: EthiClinician achieves almost completely unbiased accuracy on the BiasMD dataset, while GPT - 4 has an accuracy rate of 90.1%. - **DiseaseMatcher performance**: EthiClinician has an accuracy rate of 92.47% on the DiseaseMatcher dataset, which is significantly higher than other models, such as GPT - 4 (82.84%), Llama2 - 7B (20.4%) and ChatDoctor (51.44%). ### Discussion - **Importance of ethics and accuracy**: In the healthcare field, it is crucial to ensure the fairness and accuracy of LLMs, because demographic differences may exacerbate existing biases. - **Future directions**: There is a need to further expand datasets to better represent a wider range of human identities, so as to build more inclusive AI systems. In conclusion, this paper provides important tools and methods for improving the ethics and diagnostic accuracy of LLMs in the healthcare field by introducing new datasets and models.