Localized Questions in Medical Visual Question Answering

Sergio Tascon-Morales,Pablo Márquez-Neila,Raphael Sznitman
2023-07-03
Abstract:Visual Question Answering (VQA) models aim to answer natural language questions about given images. Due to its ability to ask questions that differ from those used when training the model, medical VQA has received substantial attention in recent years. However, existing medical VQA models typically focus on answering questions that refer to an entire image rather than where the relevant content may be located in the image. Consequently, VQA models are limited in their interpretability power and the possibility to probe the model about specific image regions. This paper proposes a novel approach for medical VQA that addresses this limitation by developing a model that can answer questions about image regions while considering the context necessary to answer the questions. Our experimental results demonstrate the effectiveness of our proposed model, outperforming existing methods on three datasets. Our code and data are available at <a class="link-external link-https" href="https://github.com/sergiotasconmorales/locvqa" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in Medical Visual Question Answering (VQA), existing models are usually only able to answer questions about the entire image, but are unable to ask or answer questions regarding specific regions in the image. This limitation restricts the model's interpretability and the possibility of inquiring about specific image regions. Therefore, this paper proposes a new method, aiming to develop a model that can answer questions about image regions while taking into account the context information required to answer these questions. Specifically, the main contributions of the paper are as follows: 1. **Proposing a new VQA architecture**: This architecture can focus on specific regions to answer questions on the basis of considering the context of the entire image, thereby improving the effectiveness and accuracy of the model in handling localization problems. 2. **Introducing the multi - glance attention mechanism**: Through this mechanism, the model can first consider the image as a whole, and then limit its attention to the regions related to the question, while retaining the context information of the question and its regions. 3. **Experimental verification**: The authors conducted extensive experiments on three datasets and compared them with existing baseline methods. The results show that the proposed model has an improvement in performance. Through these contributions, the paper effectively solves the limitations of existing medical VQA models in handling localization problems, providing a more powerful and flexible tool for medical image analysis.