Abstract:Summary Background Artificial intelligence (AI) applications in health care have been effective in many areas of medicine, but they are often trained for a single task using labelled data, making deployment and generalisability challenging. How well a general-purpose AI language model performs diagnosis and triage relative to physicians and laypeople is not well understood. Methods We compared the predictive accuracy of Generative Pre-trained Transformer 3 (GPT-3)'s diagnostic and triage ability for 48 validated synthetic case vignettes (<50 words; sixth-grade reading level or below) of both common (eg, viral illness) and severe (eg, heart attack) conditions to a nationally representative sample of 5000 lay people from the USA who could use the internet to find the correct options and 21 practising physicians at Harvard Medical School. There were 12 vignettes for each of four triage categories: emergent, within one day, within 1 week, and self-care. The correct diagnosis and triage category (ie, ground truth) for each vignette was determined by two general internists at Harvard Medical School. For each vignette, human respondents and GPT-3 were prompted to list diagnoses in order of likelihood, and the vignette was marked as correct if the ground-truth diagnosis was in the top three of the listed diagnoses. For triage accuracy, we examined whether the human respondents' and GPT-3's selected triage was exactly correct according to the four triage categories, or matched a dichotomised triage variable (emergent or within 1 day vs within 1 week or self-care). We estimated GPT-3's diagnostic and triage confidence on a given vignette using a modified bootstrap resampling procedure, and examined how well calibrated GPT-3's confidence was by computing calibration curves and Brier scores. We also performed subgroup analysis by case acuity, and an error analysis for triage advice to characterise how its advice might affect patients using this tool to decide if they should seek medical care immediately. Findings Among all cases, GPT-3 replied with the correct diagnosis in its top three for 88% (42/48, 95% CI 75–94) of cases, compared with 54% (2700/5000, 53–55) for lay individuals (p<0.0001) and 96% (637/666, 94–97) for physicians (p=0·012). GPT-3 triaged 70% correct (34/48, 57–82) versus 74% (3706/5000, 73–75; p=0.60) for lay individuals and 91% (608/666, 89–93%; p<0.0001) for physicians. As measured by the Brier score, GPT-3 confidence in its top prediction was reasonably well calibrated for diagnosis (Brier score=0·18) and triage (Brier score=0·22). We observed an inverse relationship between case acuity and GPT-3 accuracy (p<0·0001) with a fitted trend line of –8·33% decrease in accuracy for every level of increase in case acuity. For triage error analysis, GPT-3 deprioritised truly emergent cases in seven instances. Interpretation A general-purpose AI language model without any content-specific training could perform diagnosis at levels close to, but below, physicians and better than lay individuals. We found that GPT-3's performance was inferior to physicians for triage, sometimes by a large margin, and its performance was closer to that of lay individuals. Although the diagnostic performance of GPT-3 was comparable to physicians, it was significantly better than a typical person using a search engine. Funding The National Heart, Lung, and Blood Institute.

Performance of GPT-4 and GPT-3.5 in generating accurate and comprehensive diagnoses across medical subspecialties

Performance Evaluation of the Generative Pre-trained Transformer (GPT-4) on the Family Medicine In-Training Examination

Performance of Progressive Generations of GPT on an Exam Designed for Certifying Physicians as Certified Clinical Densitometrists

Comprehensive analysis of the performance of GPT-3.5 and GPT-4 on the American Urological Association self-assessment study program exams from 2012-2023

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study

Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis

Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases

The Diagnostic Ability of GPT-3.5 and GPT-4.0 in Surgery: Comparative Analysis

Evaluating ChatGPT-4's Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases

Comparative Analysis of GPT-4Vision, GPT-4 and Open Source LLMs in Clinical Diagnostic Accuracy: A Benchmark Against Human Expertise

Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4

Comparative analysis of GPT-3.5 and GPT-4.0 in Taiwan's medical technologist certification: A study in artificial intelligence advancements

Assessing GPT-4 multimodal performance in radiological image analysis

The Case Records of ChatGPT: Language Models and Complex Clinical Questions

Capabilities of GPT-4 on Medical Challenge Problems

Assessing GPT-4's Performance in Delivering Medical Advice: Comparative Analysis With Human Experts

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis

Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations

Evaluating GPT-4o's Performance in the Official European Board of Radiology Exam: A Comprehensive Assessment

Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports