Targeted Visual Prompting for Medical Visual Question Answering

Sergio Tascon-Morales,Pablo Márquez-Neila,Raphael Sznitman

2024-08-06

Abstract:With growing interest in recent years, medical visual question answering (Med-VQA) has rapidly evolved, with multimodal large language models (MLLMs) emerging as an alternative to classical model architectures. Specifically, their ability to add visual information to the input of pre-trained LLMs brings new capabilities for image interpretation. However, simple visual errors cast doubt on the actual visual understanding abilities of these models. To address this, region-based questions have been proposed as a means to assess and enhance actual visual understanding through compositional evaluation. To combine these two perspectives, this paper introduces targeted visual prompting to equip MLLMs with region-based questioning capabilities. By presenting the model with both the isolated region and the region in its context in a customized visual prompt, we show the effectiveness of our method across multiple datasets while comparing it to several baseline models. Our code and data are available at <a class="link-external link-https" href="https://github.com/sergiotasconmorales/locvqallm" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the issue of visual understanding in the Medical Visual Question Answering (Med-VQA) task. Specifically, although Multimodal Large Language Models (MLLMs) perform well in Med-VQA, they exhibit some visual understanding errors when processing images. To solve this problem, the paper proposes a method called **Targeted Visual Prompting**. This method enables the model to better understand and answer questions about specific image regions by including both the region of interest and its contextual information in the prompts. Specifically, the main contributions of the paper are as follows: 1. **Introduction of the Targeted Visual Prompting method**: By designing a special prompting method that integrates both local and global information of the image, the model's understanding of local regions is enhanced. 2. **Experimental validation**: Extensive experiments were conducted on multiple datasets, including Diabetic Macular Edema (DME-VQA), Da Vinci Surgical Robot Images (RIS-VQA), and Cataract Surgery Video Frames (INSEGCAT-VQA), demonstrating the effectiveness of the proposed method. 3. **Performance improvement**: Compared to other baseline methods, the proposed method achieved higher accuracy and F1 scores on all tested datasets, particularly excelling in handling localization issues. In summary, this research aims to improve the visual prompting method to enhance the understanding and interpretative capabilities of multimodal large language models on medical images, thereby improving the accuracy of the Med-VQA task.

Targeted Visual Prompting for Medical Visual Question Answering

Localized Questions in Medical Visual Question Answering

Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering

Visual Question Answering in the Medical Domain

Consistency-preserving Visual Question Answering in Medical Imaging

Medical Visual Question Answering via Conditional Reasoning and Contrastive Learning

LaPA: Latent Prompt Assist Model For Medical Visual Question Answering

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Multimodal Prompt Retrieval for Generative Visual Question Answering

Interpretable medical image Visual Question Answering via multi-modal relationship graph learning

Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts

Candidate-Heuristic In-Context Learning: A new framework for enhancing medical visual question answering with LLMs

Medical visual question answering with symmetric interaction attention and cross-modal gating

Visual Question Answering in Ophthalmology: A Progressive and Practical Perspective

Efficient In-Context Medical Segmentation with Meta-driven Visual Prompt Selection

Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models

R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest

Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge

Counterfactual Causal-Effect Intervention for Interpretable Medical Visual Question Answering

A Question-Centric Model for Visual Question Answering in Medical Imaging

Dual modality prompt learning for visual question-grounded answering in robotic surgery