Targeted Visual Prompting for Medical Visual Question Answering

Sergio Tascon-Morales,Pablo Márquez-Neila,Raphael Sznitman
2024-08-06
Abstract:With growing interest in recent years, medical visual question answering (Med-VQA) has rapidly evolved, with multimodal large language models (MLLMs) emerging as an alternative to classical model architectures. Specifically, their ability to add visual information to the input of pre-trained LLMs brings new capabilities for image interpretation. However, simple visual errors cast doubt on the actual visual understanding abilities of these models. To address this, region-based questions have been proposed as a means to assess and enhance actual visual understanding through compositional evaluation. To combine these two perspectives, this paper introduces targeted visual prompting to equip MLLMs with region-based questioning capabilities. By presenting the model with both the isolated region and the region in its context in a customized visual prompt, we show the effectiveness of our method across multiple datasets while comparing it to several baseline models. Our code and data are available at <a class="link-external link-https" href="https://github.com/sergiotasconmorales/locvqallm" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issue of visual understanding in the Medical Visual Question Answering (Med-VQA) task. Specifically, although Multimodal Large Language Models (MLLMs) perform well in Med-VQA, they exhibit some visual understanding errors when processing images. To solve this problem, the paper proposes a method called **Targeted Visual Prompting**. This method enables the model to better understand and answer questions about specific image regions by including both the region of interest and its contextual information in the prompts. Specifically, the main contributions of the paper are as follows: 1. **Introduction of the Targeted Visual Prompting method**: By designing a special prompting method that integrates both local and global information of the image, the model's understanding of local regions is enhanced. 2. **Experimental validation**: Extensive experiments were conducted on multiple datasets, including Diabetic Macular Edema (DME-VQA), Da Vinci Surgical Robot Images (RIS-VQA), and Cataract Surgery Video Frames (INSEGCAT-VQA), demonstrating the effectiveness of the proposed method. 3. **Performance improvement**: Compared to other baseline methods, the proposed method achieved higher accuracy and F1 scores on all tested datasets, particularly excelling in handling localization issues. In summary, this research aims to improve the visual prompting method to enhance the understanding and interpretative capabilities of multimodal large language models on medical images, thereby improving the accuracy of the Med-VQA task.