Abstract:When the results of the Goh et al study 1 were presented at a recent National Academies of Medicine meeting, the audience was amazed—and concerned. The randomized clinical trial assessed diagnostic performance by generalist physicians, who were asked to provide diagnoses for 6 simulated cases using either conventional online resources or a large language model (LLM) (ChatGPT Plus [GPT-4]; OpenAI) in addition to standard resources. The study also evaluated the ability of the LLM to solve the cases alone. The authors developed a rubric for measuring diagnostic performance in which blinded experts evaluated participants' overall clinical reasoning process, including their proposed final diagnosis, their differential diagnosis, and factors supporting or opposing the diagnoses on their list. The study's principal finding was that physicians who had access to the LLM scored no better than the group who only used conventional resources. But the result that prompted consternation was the performance of the LLM alone, which scored significantly higher than either group of physicians. On hearing these results, more than one audience member wondered aloud, "Are we going to be out of a job?" The Goh et al trial 1 is an important advance in the study of generative artificial intelligence (AI) for diagnosis. By examining how clinicians use GPT-4 without specific training in use of the LLM, the study provides a realistic assessment of how clinicians use these tools in actual practice—both now and for the foreseeable future. Measuring the quality of the diagnostic process, rather than simply evaluating the accuracy of their final diagnosis, is a nuanced approach that provides a more accurate assessment of diagnostic reasoning. Future studies of diagnostic reasoning should use this method. The study demonstrates that access to generative AI alone will not improve diagnostic outcomes; clinicians will require training to use these resources to their full potential. The authors appropriately caution that the results of this study "should not be interpreted to indicate that LLMs should be used for diagnosis autonomously without physician oversight," 1 but the finding that the LLM outperformed physicians will almost certainly be the headline finding. Since making diagnoses is central to clinician's professional identity, it is not surprising that the prospect of using LLM for diagnosis evokes both excitement and trepidation. As more studies are published demonstrating the diagnostic capabilities of LLMs, what does this mean for clinicians? There are reasons to be skeptical that the performance of LLMs on simulated cases can generalize to the clinical practice setting environment. The study's cases were representative of common general practice diagnoses but are presented in an orderly fashion with the relevant history, physical examination, laboratory, and imaging results necessary to construct a prioritized differential diagnosis. Diagnosis in the clinical setting is an iterative—and complicated—process that takes place amid many competing demands and requires input from the patient, caregivers, and multiple clinicians in addition to objective data. Far from a linear process, diagnosis in the clinical practice setting involves progressively refining diagnoses based on new information, and the distinction between diagnosis and treatment is often blurred as clinicians incorporate treatment response into diagnostic reasoning. How do LLMs perform at diagnosis under conditions closer to actual clinical practice? A recent study 2 evaluated the performance of LLMs on diagnosing and developing management plans for 4 common abdominal conditions, using a dataset consisting of anonymized real patient data. Information was presented to the LLM in a stepwise manner, and after each step, the LLM was asked to summarize the information and provide a diagnosis or request additional testing. Once the LLM provided a diagnosis, it was required to recommend a treatment plan. When confronted with this realistic clinical decision-making scenario, LLMs performed poorly: significantly worse than physicians for all but the simplest diagnoses. The LLMs also failed to consistently request appropriate diagnostic testing and frequently made incorrect treatment recommendations even after arriving at the correct diagnosis. Continued refinement of LLMs may eventually address these limitations, and this may happen quickly given the rapidity with which the performance of LLMs has improved. But even if LLMs prove to be capable of iterative diagnosis based on evolving information, will they reduce harm from diagnostic errors? Here again, some skepticism is warranted. As with all adverse events, missed and delayed diagnoses occur due to underlying system failures (latent errors) that allow errors made by individuals (active errors) to reach the patient and cause harm. The powe -Abstract Truncated-

The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant.

A Survey of Clinicians’ Views of the Utility of Large Language Models

Based on Medicine, The Now and Future of Large Language Models

Evaluation of large language models as a diagnostic aid for complex medical cases

On the limitations of large language models in clinical diagnosis

Critical Care Studies Using Large Language Models Based on Electronic Healthcare Records: A Technical Note

Evaluating the use of large language models to provide clinical recommendations in the Emergency Department

Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks

Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery

Digital Diagnostics: The Potential Of Large Language Models In Recognizing Symptoms Of Common Illnesses

Large language model application in emergency medicine and critical care

Transformative potential of Large Language Models in data mining on Electronic Health Records.

Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

Potential applications and implications of large language models in primary care

Large Language Models Like ChatGPT Show Promise, but Clinical Use of Artificial Intelligence Requires Physician Partnership to Enable Patient Care, Minimize Administrative Burden, Maximize Efficiency, and Minimize Risk

Large Language Models—Misdiagnosing Diagnostic Excellence?

Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios

Systematic review: The use of large language models as medical chatbots in digestive diseases

The Pulse of Artificial Intelligence in Cardiology: A Comprehensive Evaluation of State-of-the-art Large Language Models for Potential Use in Clinical Cardiology

A systematic evaluation of the performance of GPT-4 and PaLM2 to diagnose comorbidities in MIMIC-IV patients

The future landscape of large language models in medicine