Visual-Textual Integration in LLMs for Medical Diagnosis: A Quantitative Analysis

Reem Agbareia,Mahmud Omar Sr.,Shelly Soffer Sr.,Benjamin S Glicksberg,Girish Nadkarni,Eyal Klang
DOI: https://doi.org/10.1101/2024.08.31.24312878
2024-09-03
Abstract:Background and Aim: Visual data from images is essential for many medical diagnoses. This study evaluates the performance of multimodal Large Language Models (LLMs) in integrating textual and visual information for diagnostic purposes. Methods: We tested GPT-4o and Claude Sonnet 3.5 on 120 clinical vignettes with and without accompanying images. Each vignette included patient demographics, a chief complaint, and relevant medical history. Vignettes were paired with either clinical or radiological images from two sources: 100 images from the OPENi database and 20 images from recent NEJM challenges, ensuring they were not in the LLMs' training sets. Three primary care physicians served as a human benchmark. We analyzed diagnostic accuracy and the models' explanations for a subset of cases. Results: LLMs outperformed physicians in text-only scenarios (GPT-4o: 70.8%, Claude Sonnet 3.5: 59.5%, Physicians: 39.5%). With image integration, all improved, but physicians showed the largest gain (GPT-4o: 84.5%, p<0.001; Claude Sonnet 3.5: 67.3%, p=0.060; Physicians: 78.8%, p<0.001). LLMs changed their explanations in 45-60% of cases when presented with images, demonstrating some level of visual data integration. Conclusion: Multimodal LLMs show promise in medical diagnosis, with improved performance when integrating visual evidence. However, this improvement is inconsistent and smaller compared to physicians, indicating a need for enhanced visual data processing in these models. Keywords: Artificial Intelligence, Medical Diagnosis, Multimodal Learning, Large Language Models, Visual Data Integration
What problem does this paper attempt to address?
The paper aims to address the following issues: 1. **Evaluating the performance of multimodal large language models (LLMs) in medical diagnosis**: Specifically, it examines how these models integrate textual and visual information for diagnosis. The paper tests the diagnostic accuracy of two multimodal LLMs, GPT-4o and Claude Sonnet 3.5, in clinical cases with and without images. 2. **Comparing the performance of multimodal LLMs with human doctors**: By comparing the diagnostic accuracy of the models and human doctors in text-only and combined text-image scenarios, the study assesses the practical application value of LLMs in medical diagnosis. 3. **Analyzing changes in model explanations**: The study investigates whether the diagnostic logic and explanations of the models change when they receive image information and explores whether such changes indicate effective utilization of visual data by the models. In summary, the paper primarily focuses on the performance improvement of multimodal large language models in medical diagnosis and their advantages and disadvantages compared to human doctors.