Abstract:Background: The potential of large language models (LLM) such as GPT to support complex tasks such as differential diagnosis has been a subject of debate, with some ascribing near sentient abilities to the models and others claiming that LLMs merely perform "autocomplete on steroids". A recent study reported that the Generative Pretrained Transformer 4 (GPT-4) model performed well in complex differential diagnostic reasoning. The authors assessed the performance of GPT-4 in identifying the correct diagnosis in a series of case records from the New England Journal of Medicine. The authors constructed prompts based on the clinical presentation section of the case reports, and compared the results of GPT-4 to the actual diagnosis. GPT-4 returned the correct diagnosis as a part of its response in 64% of cases, with the correct diagnosis being at rank 1 in 39% of cases. However, such concise but comprehensive narratives of the clinical course are not typically available in electronic health records (EHRs). Further, if they were available, EHR records contain identifying information whose transmission is prohibited by Health Insurance Portability and Accountability Act (HIPAA) regulations. Methods: To assess the expected performance of GPT on comparable datasets that can be generated by text mining and by design cannot contain identifiable information, we parsed the texts of the case reports and extracted Human Phenotype Ontology (HPO) terms, from which prompts for GPT were constructed that contain largely the same clinical abnormalities but lack the surrounding narrative. Results: While the performance of GPT-4 on the original narrative-based text was good, with the final diagnosis being included in its differential in 29/75 cases (38.7%; rank 1 in 17.3% of cases; mean rank of 3.4), the performance of GPT-4 on the feature-based approach that includes the major clinical abnormalities without additional narrative texas substantially worse, with GPT-4 including the final diagnosis in its differential in 8/75 cases (10.7%; rank 1 in 4.0% of cases; mean rank of 3.9). Interpretation: We consider the feature-based queries to be a more appropriate test of the performance of GPT-4 in diagnostic tasks, since it is unlikely that the narrative approach can be used in actual clinical practice. Future research and algorithmic development is needed to determine the optimal approach to leveraging LLMs for clinical diagnosis.

The Clinical Utility of Large Language Models in Diagnosing Neurocognitive Disorders among NACC Participants

A - 133 The Clinical Utility of Large Language Models in Diagnosing Neurocognitive Disorders among NACC Participants

ChatGPT Assisting Diagnosis of Neuro-ophthalmology Diseases Based on Case Reports

Uncovering Language Disparity of ChatGPT in Healthcare: Non-English Clinical Environment for Retinal Vascular Disease Classification (Preprint)

Evaluating Large Language Models in Extracting Cognitive Exam Dates and Scores

Diagnostic accuracy of large language models in psychiatry

The Case Records of ChatGPT: Language Models and Complex Clinical Questions

The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant.

On the limitations of large language models in clinical diagnosis

Are Different Versions of ChatGPT's Ability Comparable to the Clinical Diagnosis Presented in Case Reports? A Descriptive Study

A Survey of Clinicians’ Views of the Utility of Large Language Models

Exploiting ChatGPT for Diagnosing Autism-Associated Language Disorders and Identifying Distinct Features

Evaluating cognitive performance: Traditional methods vs. ChatGPT

Text dialogue analysis Based ChatGPT for Primary Screening of Mild Cognitive Impairment

Applications of large language models in psychiatry: a systematic review

Text Dialogue Analysis for Primary Screening of Mild Cognitive Impairment: Development and Validation Study

Enhancing Diagnostic Support for Chiari Malformation and Syringomyelia: A Comparative Study of Contextualized ChatGPT Models

Uncovering Language Disparity of ChatGPT on Retinal Vascular Disease Classification: Cross-Sectional Study

GPT-4 and Neurologists in Screening for Mild Cognitive Impairment in the Elderly: A Comparative Analysis Study

ChatGPT's Inconsistency in the Diagnosis of Alzheimer's Disease

Evaluation of large language models as a diagnostic aid for complex medical cases