Abstract:Background The use of large language models (LLM) has recently gained popularity in diverse areas, including answering questions posted by patients as well as medical professionals. Objective To evaluate the performance and limitations of LLMs in providing the correct diagnosis for a complex clinical case. Design Seventy-five consecutive clinical cases were selected from the Massachusetts General Hospital Case Records, and differential diagnoses were generated by OpenAI’s GPT3.5 and 4 models. Results The mean number of diagnoses provided by the Massachusetts General Hospital case discussants was 16.77, by GPT3.5 30 and by GPT4 15.45 ( p < 0.0001). GPT4 was more frequently able to list the correct diagnosis as first (22% versus 20% with GPT3.5, p = 0.86), provide the correct diagnosis among the top three generated diagnoses (42% versus 24%, p = 0.075). GPT4 was better at providing the correct diagnosis, when the different diagnoses were classified into groups according to the medical specialty and include the correct diagnosis at any point in the differential list (68% versus 48%, p = 0.0063). GPT4 provided a differential list that was more similar to the list provided by the case discussants than GPT3.5 (Jaccard Similarity Index 0.22 versus 0.12, p = 0.001). Inclusion of the correct diagnosis in the generated differential was correlated with PubMed articles matching the diagnosis (OR 1.40, 95% CI 1.25–1.56 for GPT3.5, OR 1.25, 95% CI 1.13–1.40 for GPT4), but not with disease incidence. Conclusions and relevance The GPT4 model was able to generate a differential diagnosis list with the correct diagnosis in approximately two thirds of cases, but the most likely diagnosis was often incorrect for both models. In its current state, this tool can at most be used as an aid to expand on potential diagnostic considerations for a case, and future LLMs should be trained which account for the discrepancy between disease incidence and availability in the literature.

Multimodal Large Language Models are Generalist Medical Image Interpreters

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports

Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis

An Early Investigation into the Utility of Multimodal Large Language Models in Medical Imaging

Evaluating General Vision-Language Models for Clinical Medicine

Comparison of Multi-Modal Large Language Models with Deep Learning Models for Medical Image Classification

Visual-Textual Integration in LLMs for Medical Diagnosis: A Quantitative Analysis

Capability of multimodal large language models to interpret pediatric radiological images

Multimodal Foundation Models Exploit Text to Make Medical Image Predictions

Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations

Assessing large multimodal models for one-shot learning and interpretability in biomedical image classification

Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4

Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models

Evaluating Large Language Model (LLM) Performance on Established Breast Classification Systems

Multimodal Large Language Models for Bioimage Analysis

In-context learning enables multimodal large language models to classify cancer pathology images

Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries

Large language models (LLMs) in radiology exams for medical students: Performance and consequences

From Text to Multimodality: Exploring the Evolution and Impact of Large Language Models in Medical Practice

Evaluation of large language models as a diagnostic aid for complex medical cases