Abstract:Background: Whether GPT-4, the conversational artificial intelligence, can accurately diagnose and triage health conditions and whether it presents racial and ethnic biases in its decisions remain unclear. Objective: We aim to assess the accuracy of GPT-4 in the diagnosis and triage of health conditions and whether its performance varies by patient race and ethnicity. Methods: We compared the performance of GPT-4 and physicians, using 45 typical clinical vignettes, each with a correct diagnosis and triage level, in February and March 2023. For each of the 45 clinical vignettes, GPT-4 and 3 board-certified physicians provided the most likely primary diagnosis and triage level (emergency, nonemergency, or self-care). Independent reviewers evaluated the diagnoses as "correct" or "incorrect." Physician diagnosis was defined as the consensus of the 3 physicians. We evaluated whether the performance of GPT-4 varies by patient race and ethnicity, by adding the information on patient race and ethnicity to the clinical vignettes. Results: The accuracy of diagnosis was comparable between GPT-4 and physicians (the percentage of correct diagnosis was 97.8% (44/45; 95% CI 88.2%-99.9%) for GPT-4 and 91.1% (41/45; 95% CI 78.8%-97.5%) for physicians; P =.38). GPT-4 provided appropriate reasoning for 97.8% (44/45) of the vignettes. The appropriateness of triage was comparable between GPT-4 and physicians (GPT-4: 30/45, 66.7%; 95% CI 51.0%-80.0%; physicians: 30/45, 66.7%; 95% CI 51.0%-80.0%; P =.99). The performance of GPT-4 in diagnosing health conditions did not vary among different races and ethnicities (Black, White, Asian, and Hispanic), with an accuracy of 100% (95% CI 78.2%-100%). P values, compared to the GPT-4 output without incorporating race and ethnicity information, were all .99. The accuracy of triage was not significantly different even if patients' race and ethnicity information was added. The accuracy of triage was 62.2% (95% CI 46.5%-76.2%; P =.50) for Black patients; 66.7% (95% CI 51.0%-80.0%; P =.99) for White patients; 66.7% (95% CI 51.0%-80.0%; P =.99) for Asian patients, and 62.2% (95% CI 46.5%-76.2%; P =.69) for Hispanic patients. P values were calculated by comparing the outputs with and without conditioning on race and ethnicity. Conclusions: GPT-4's ability to diagnose and triage typical clinical vignettes was comparable to that of board-certified physicians. The performance of GPT-4 did not vary by patient race and ethnicity. These findings should be informative for health systems looking to introduce conversational artificial intelligence to improve the efficiency of patient diagnosis and triage.

Even with ChatGPT, race matters

ChatGPT Exhibits Gender and Racial Biases in Acute Coronary Syndrome Management

Performance of ChatGPT on the MCAT: The Road to Personalized and Equitable Premedical Learning

Unmasking and Quantifying Racial Bias of Large Language Models in Medical Report Generation

Coding Inequity: Assessing GPT-4's Potential for Perpetuating Racial and Gender Biases in Healthcare

Fairness in AI-Driven Oncology: Investigating Racial and Gender Biases in Large Language Models

Computer says 'no': Exploring systemic bias in ChatGPT using an audit approach

Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study

Uncovering Language Disparity of ChatGPT in Healthcare: Non-English Clinical Environment for Retinal Vascular Disease Classification (Preprint)

How GPT-3 responds to different publics on climate change and Black Lives Matter: A critical appraisal of equity in conversational AI

Large Language Models Portray Socially Subordinate Groups as More Homogeneous, Consistent with a Bias Observed in Humans

The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study

Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination

Artificial intelligence in global health equity: an evaluation and discussion on the application of ChatGPT, in the Chinese National Medical Licensing Examination

Toxicity in ChatGPT: Analyzing Persona-assigned Language Models

Stars, Stripes, and Silicon: Unravelling the ChatGPT's All-American, Monochrome, Cis-centric Bias

Write It Like You See It: Detectable Differences in Clinical Notes By Race Lead To Differential Model Recommendations

Quite Good, but Not Enough: Nationality Bias in Large Language Models -- A Case Study of ChatGPT

Disability Ethics and Education in the Age of Artificial Intelligence: Identifying Ability Bias in ChatGPT and Gemini

A vignette-based evaluation of ChatGPT's ability to provide appropriate and equitable medical advice across care contexts

Is ChatGPT More Biased Than You?