Abstract:Clinical Decision Support Systems (CDSS) utilize evidence-based knowledge and patient data to offer real-time recommendations, with Large Language Models (LLMs) emerging as a promising tool to generate plain-text explanations for medical decisions. This study explores the effectiveness and reliability of LLMs in generating explanations for diagnoses based on patient complaints. Three experienced doctors evaluated LLM-generated explanations of the connection between patient complaints and doctor and model-assigned diagnoses across several stages. Experimental results demonstrated that LLM explanations significantly increased doctors' agreement rates with given diagnoses and highlighted potential errors in LLM outputs, ranging from 5% to 30%. The study underscores the potential and challenges of LLMs in healthcare and emphasizes the need for careful integration and evaluation to ensure patient safety and optimal clinical utility.

What problem does this paper attempt to address?

### What Problem Does This Paper Attempt to Solve? This paper aims to explore the effectiveness and reliability of large language models (LLMs) in generating medical diagnostic explanations. Specifically, the researchers hope to address the following points: 1. **Improving Consistency Among Doctors in Diagnoses**: By generating easily understandable text explanations, LLMs can help doctors better understand the connection between a patient's symptoms and the diagnosis, thereby improving consistency among doctors for specific diagnoses. 2. **Evaluating the Quality of LLM-Generated Explanations**: The researchers designed a series of experiments, inviting 3 experienced doctors to evaluate the explanations generated by LLMs to determine if these explanations are accurate, clear, and error-free. 3. **Exploring the Potential Application of LLMs in Clinical Decision Support Systems (CDSS)**: Through this study, the researchers aim to understand the practical application of LLMs in the medical field, particularly whether they can effectively assist doctors in making more accurate diagnoses. 4. **Identifying Potential Errors in LLM-Generated Explanations**: The researchers also focus on the types of errors that may exist in LLM-generated explanations, including fabricated symptoms, unclear arguments, etc., and analyze the impact of these errors on doctors' decisions. ### Research Background Clinical Decision Support Systems (CDSS) utilize evidence-based knowledge and patient data to provide real-time recommendations to medical professionals. With technological advancements, large language models (LLMs) have emerged as a promising tool due to their ability to generate natural language explanations. However, whether the explanations generated by LLMs are reliable, accurate, and how these explanations influence doctors' decisions remain questions that require in-depth research. ### Experimental Design The researchers used patient complaints and diagnostic data from the RuMedBench dataset and generated explanations by calling the GPT-3.5-turbo model via API. The experiment was divided into 3 stages: 1. **Stage 1**: Evaluate the quality of LLM-generated explanations. Doctors need to judge whether the provided diagnosis is reasonable, whether the explanation is correct, and whether there are any errors in the explanation. 2. **Stage 2**: Evaluate the impact of explanations on doctors' decisions. Doctors judge whether the diagnosis is reasonable without the explanation. 3. **Stage 3**: Evaluate the quality of new diagnoses and their explanations generated by LLMs. Doctors conduct another evaluation. ### Experimental Results 1. **Impact of Explanations on Doctors' Decisions**: The experimental results show that providing explanations significantly improved the consistency of doctors' diagnoses, but some errors in the explanations were also found. 2. **Agreement Rate Between Doctors and Model Diagnoses**: Doctors were more consistent with the model-generated diagnoses than with the original doctor-recorded diagnoses. This may be because the model diagnoses were based solely on patient complaints, while the original diagnoses might have considered more additional information. 3. **Quality Evaluation of Explanations**: After filtering, most explanations were considered reasonable, but 5% to 30% of the explanations still contained errors, mainly focusing on fabricated symptoms, unclear arguments, etc. ### Discussion 1. **Difficulty in Evaluating Explanation Quality**: Different doctors have varying requirements and standards for explanations, which increases the difficulty of evaluating explanation quality. 2. **Impact of Explanations on Doctors' Decisions**: Explanations can significantly improve the consistency of doctors' diagnoses but may also introduce new errors. 3. **Prospects of LLMs in CDSS**: Although LLMs show potential in generating medical explanations, further research and improvements are needed to ensure their safety and effectiveness in clinical applications. Overall, this paper demonstrates the potential of LLMs in generating medical diagnostic explanations through experiments but also points out some challenges that need to be overcome in practical applications.

Deciphering Diagnoses: How Large Language Models Explanations Influence Clinical Decision Making

Evaluating large language model workflows in clinical decision support: referral, triage, and diagnosis

Towards Accurate Differential Diagnosis with Large Language Models

Large Language Model Influence on Diagnostic Reasoning

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Integrating Automated Knowledge Extraction with Large Language Models for Explainable Medical Decision-Making

Effectiveness of nicotine patches in relation to genotype in women versus men: randomised controlled trial

Interpretable Differential Diagnosis with Dual-Inference Large Language Models

On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models

Evaluating and Mitigating Limitations of Large Language Models in Clinical Decision Making

Large language models improve clinical decision making of medical students through patient simulation and structured feedback: a randomized controlled trial

Large Language Models and Medical Knowledge Grounding for Diagnosis Prediction

Large Language Models for Disease Diagnosis: A Scoping Review

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Leveraging A Medical Knowledge Graph into Large Language Models for Diagnosis Prediction

The Analysis of the Difference between Infrared Soil Temperature and L Band Effective Soil Temperature

Large language models in solving clinical dilemmas - advantages and drawbacks

Self-Diagnosis and Large Language Models: A New Front for Medical Misinformation

Evaluating large language models in medical applications: a survey