Deciphering Diagnoses: How Large Language Models Explanations Influence Clinical Decision Making

D.Umerenkov,G.Zubkova,A.Nesterov
2023-10-03
Abstract:Clinical Decision Support Systems (CDSS) utilize evidence-based knowledge and patient data to offer real-time recommendations, with Large Language Models (LLMs) emerging as a promising tool to generate plain-text explanations for medical decisions. This study explores the effectiveness and reliability of LLMs in generating explanations for diagnoses based on patient complaints. Three experienced doctors evaluated LLM-generated explanations of the connection between patient complaints and doctor and model-assigned diagnoses across several stages. Experimental results demonstrated that LLM explanations significantly increased doctors' agreement rates with given diagnoses and highlighted potential errors in LLM outputs, ranging from 5% to 30%. The study underscores the potential and challenges of LLMs in healthcare and emphasizes the need for careful integration and evaluation to ensure patient safety and optimal clinical utility.
Computation and Language
What problem does this paper attempt to address?
### What Problem Does This Paper Attempt to Solve? This paper aims to explore the effectiveness and reliability of large language models (LLMs) in generating medical diagnostic explanations. Specifically, the researchers hope to address the following points: 1. **Improving Consistency Among Doctors in Diagnoses**: By generating easily understandable text explanations, LLMs can help doctors better understand the connection between a patient's symptoms and the diagnosis, thereby improving consistency among doctors for specific diagnoses. 2. **Evaluating the Quality of LLM-Generated Explanations**: The researchers designed a series of experiments, inviting 3 experienced doctors to evaluate the explanations generated by LLMs to determine if these explanations are accurate, clear, and error-free. 3. **Exploring the Potential Application of LLMs in Clinical Decision Support Systems (CDSS)**: Through this study, the researchers aim to understand the practical application of LLMs in the medical field, particularly whether they can effectively assist doctors in making more accurate diagnoses. 4. **Identifying Potential Errors in LLM-Generated Explanations**: The researchers also focus on the types of errors that may exist in LLM-generated explanations, including fabricated symptoms, unclear arguments, etc., and analyze the impact of these errors on doctors' decisions. ### Research Background Clinical Decision Support Systems (CDSS) utilize evidence-based knowledge and patient data to provide real-time recommendations to medical professionals. With technological advancements, large language models (LLMs) have emerged as a promising tool due to their ability to generate natural language explanations. However, whether the explanations generated by LLMs are reliable, accurate, and how these explanations influence doctors' decisions remain questions that require in-depth research. ### Experimental Design The researchers used patient complaints and diagnostic data from the RuMedBench dataset and generated explanations by calling the GPT-3.5-turbo model via API. The experiment was divided into 3 stages: 1. **Stage 1**: Evaluate the quality of LLM-generated explanations. Doctors need to judge whether the provided diagnosis is reasonable, whether the explanation is correct, and whether there are any errors in the explanation. 2. **Stage 2**: Evaluate the impact of explanations on doctors' decisions. Doctors judge whether the diagnosis is reasonable without the explanation. 3. **Stage 3**: Evaluate the quality of new diagnoses and their explanations generated by LLMs. Doctors conduct another evaluation. ### Experimental Results 1. **Impact of Explanations on Doctors' Decisions**: The experimental results show that providing explanations significantly improved the consistency of doctors' diagnoses, but some errors in the explanations were also found. 2. **Agreement Rate Between Doctors and Model Diagnoses**: Doctors were more consistent with the model-generated diagnoses than with the original doctor-recorded diagnoses. This may be because the model diagnoses were based solely on patient complaints, while the original diagnoses might have considered more additional information. 3. **Quality Evaluation of Explanations**: After filtering, most explanations were considered reasonable, but 5% to 30% of the explanations still contained errors, mainly focusing on fabricated symptoms, unclear arguments, etc. ### Discussion 1. **Difficulty in Evaluating Explanation Quality**: Different doctors have varying requirements and standards for explanations, which increases the difficulty of evaluating explanation quality. 2. **Impact of Explanations on Doctors' Decisions**: Explanations can significantly improve the consistency of doctors' diagnoses but may also introduce new errors. 3. **Prospects of LLMs in CDSS**: Although LLMs show potential in generating medical explanations, further research and improvements are needed to ensure their safety and effectiveness in clinical applications. Overall, this paper demonstrates the potential of LLMs in generating medical diagnostic explanations through experiments but also points out some challenges that need to be overcome in practical applications.