Comparative Analysis of GPT-4Vision, GPT-4 and Open Source LLMs in Clinical Diagnostic Accuracy: A Benchmark Against Human Expertise

Tianyu Han,Lisa C Adams,Keno Bressem,Felix Busch,Luisa Huck,Sven Nebelung,Daniel Truhn,Lisa Adams
DOI: https://doi.org/10.1101/2023.11.03.23297957
2023-11-06
MedRxiv
Abstract:Importance: Artificial intelligence will become an integral part of clinical medicine. Large Language Models are promising to candidates, in particular with their multimodal ability. These models need to be evaluated in real clinical cases. Objective: To test whether GPT-4V can consistently comprehend complex diagnostic scenarios. Design: A selection of 140 clinical cases from the JAMA Clinical Challenge and 348 from the NEJM Image Challenge were used. Each case, comprising a clinical image and corresponding question, was processed by GPT-4V, and responses were documented. The significance of imaging information was assessed by comparing GPT-4V's performance with that of four other leading-edge large language models (LLMs). Main Outcomes and Measures: The accuracy of responses was gauged by juxtaposing the model's answers with the established ground truths of the challenges. The confidence interval for the model's performance was calculated using bootstrapping methods. Additionally, human performance on the NEJM Image Challenge was measured by the accuracy of challenge participants. Results: GPT-4V demonstrated superior accuracy in analyses of both text and images, achieving an accuracy of 73.3% for JAMA and 88.7% for NEJM, notably outperforming text-only LLMs such as GPT-4, GPT-3.5, Llama2, and Med-42. Remarkably, both GPT-4V and GPT-4 exceeded average human participants' performance at all complexity levels within the NEJM Image Challenge. Conclusions and Relevance: GPT-4V has exhibited considerable promise in clinical diagnostic tasks, surpassing the capabilities of its predecessors as well as those of human raters who participated in the challenge. Despite these encouraging results, such models should be adopted with prudence in clinical settings, augmenting rather than replacing human judgment.
What problem does this paper attempt to address?