Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine

Thomas Savage,Ashwin Nayak,Robert Gallo,Ekanath Rangan,Jonathan H. Chen
DOI: https://doi.org/10.1038/s41746-024-01010-1
IF: 15.2
2024-01-25
npj Digital Medicine
Abstract:One of the major barriers to using large language models (LLMs) in medicine is the perception they use uninterpretable methods to make clinical decisions that are inherently different from the cognitive processes of clinicians. In this manuscript we develop diagnostic reasoning prompts to study whether LLMs can imitate clinical reasoning while accurately forming a diagnosis. We find that GPT-4 can be prompted to mimic the common clinical reasoning processes of clinicians without sacrificing diagnostic accuracy. This is significant because an LLM that can imitate clinical reasoning to provide an interpretable rationale offers physicians a means to evaluate whether an LLMs response is likely correct and can be trusted for patient care. Prompting methods that use diagnostic reasoning have the potential to mitigate the "black box" limitations of LLMs, bringing them one step closer to safe and effective use in medicine.
health care sciences & services,medical informatics
What problem does this paper attempt to address?
The paper primarily explores the application of large language models (LLMs) in the medical field, specifically how to improve these models' clinical reasoning abilities through enhanced prompt engineering and make their output more interpretable. The core issue of the research is to evaluate whether different types of diagnostic reasoning prompts can enable LLMs to mimic the thought processes of clinical doctors, thereby providing a human-understandable rationale while maintaining diagnostic accuracy. Specifically, the paper compares the traditional chain-of-thought (CoT) prompting method with four types of prompts based on clinical reasoning strategies—differential diagnosis, intuitive reasoning, analytical reasoning, and Bayesian reasoning. The experimental results show that for the GPT-3.5 model, intuitive reasoning prompts performed the best, while differential diagnosis and analytical reasoning prompts significantly reduced performance. For the more advanced GPT-4 model, all prompting methods performed similarly, but none achieved a noticeable accuracy improvement from diagnostic reasoning as human doctors do. The significance of the study lies in the fact that even though GPT-4 cannot utilize clinical reasoning to improve diagnostic accuracy like humans, it can mimic doctors' cognitive processes to generate easily understandable explanations. This is crucial for enhancing the transparency and trustworthiness of LLMs in medical scenarios. This approach helps mitigate the limitations of LLMs as "black box" systems, bringing them closer to the goal of safe and effective application in medical practice.