Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V

Zhiling Yan,Kai Zhang,Rong Zhou,Lifang He,Xiang Li,Lichao Sun
2023-10-30
Abstract:In this paper, we critically evaluate the capabilities of the state-of-the-art multimodal large language model, i.e., GPT-4 with Vision (GPT-4V), on Visual Question Answering (VQA) task. Our experiments thoroughly assess GPT-4V's proficiency in answering questions paired with images using both pathology and radiology datasets from 11 modalities (e.g. Microscopy, Dermoscopy, X-ray, CT, etc.) and fifteen objects of interests (brain, liver, lung, etc.). Our datasets encompass a comprehensive range of medical inquiries, including sixteen distinct question types. Throughout our evaluations, we devised textual prompts for GPT-4V, directing it to synergize visual and textual information. The experiments with accuracy score conclude that the current version of GPT-4V is not recommended for real-world diagnostics due to its unreliable and suboptimal accuracy in responding to diagnostic medical questions. In addition, we delineate seven unique facets of GPT-4V's behavior in medical VQA, highlighting its constraints within this complex arena. The complete details of our evaluation cases are accessible at <a class="link-external link-https" href="https://github.com/ZhilingYan/GPT4V-Medical-Report" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to evaluate the performance of the state-of-the-art multimodal large language model—GPT-4 with visual capabilities (GPT-4V) in the field of medical visual question answering (VQA). Specifically, the researchers thoroughly assessed GPT-4V using datasets from 11 modalities (such as microscopy, dermoscopy, X-ray, CT, etc.) and 15 objects of interest (such as brain, liver, lungs, etc.). Additionally, they designed text prompts to guide GPT-4V in integrating visual and textual information to answer questions. The experimental results indicate that the current version of GPT-4V lacks accuracy and reliability in diagnostic medical questions and is not recommended for actual medical diagnosis. #### Main Contributions: 1. **Comprehensive Evaluation**: A detailed assessment of GPT-4V's performance on various medical images and clinical objects, covering 16 different types of medical issues. 2. **Performance Analysis**: Through rigorous testing, it was concluded that the current version of GPT-4V is unstable and inaccurate in diagnostic medical questions. 3. **Behavioral Characteristics**: A detailed description of seven unique aspects of GPT-4V in medical VQA, revealing its limitations and adaptability. #### Experimental Setup and Results: - **Data Collection**: Samples were selected from multiple datasets such as PathVQA, VQA-RAD, and PMC-VQA to ensure diversity. - **Evaluation Criteria**: Detailed evaluation criteria were established, including directly answering questions, providing correct answers, and avoiding vague expressions. - **Accuracy Analysis**: The overall accuracy for pathology VQA tasks was 29.9%, and for radiology VQA tasks, it was 50%. Specifically, the accuracy was lower for localization and size-related questions, indicating challenges for GPT-4V in these areas.