PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery

Runlong He,Mengya Xu,Adrito Das,Danyal Z. Khan,Sophia Bano,Hani J. Marcus,Danail Stoyanov,Matthew J. Clarkson,Mobarakol Islam
2024-05-23
Abstract:Visual Question Answering (VQA) within the surgical domain, utilizing Large Language Models (LLMs), offers a distinct opportunity to improve intra-operative decision-making and facilitate intuitive surgeon-AI interaction. However, the development of LLMs for surgical VQA is hindered by the scarcity of diverse and extensive datasets with complex reasoning tasks. Moreover, contextual fusion of the image and text modalities remains an open research challenge due to the inherent differences between these two types of information and the complexity involved in aligning them. This paper introduces PitVQA, a novel dataset specifically designed for VQA in endonasal pituitary surgery and PitVQA-Net, an adaptation of the GPT2 with a novel image-grounded text embedding for surgical VQA. PitVQA comprises 25 procedural videos and a rich collection of question-answer pairs spanning crucial surgical aspects such as phase and step recognition, context understanding, tool detection and localization, and tool-tissue interactions. PitVQA-Net consists of a novel image-grounded text embedding that projects image and text features into a shared embedding space and GPT2 Backbone with an excitation block classification head to generate contextually relevant answers within the complex domain of endonasal pituitary surgery. Our image-grounded text embedding leverages joint embedding, cross-attention and contextual representation to understand the contextual relationship between questions and surgical images. We demonstrate the effectiveness of PitVQA-Net on both the PitVQA and the publicly available EndoVis18-VQA dataset, achieving improvements in balanced accuracy of 8% and 9% over the most recent baselines, respectively. Our code and dataset is available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the technical challenges of visual question answering (VQA) using large - language models (LLMs) in endoscopic pituitary surgery. Specifically, the paper aims to: 1. **Solve the problem of scarce datasets**: Existing datasets for surgical VQA are insufficient in terms of diversity and representation of complex tasks. Therefore, the paper introduces a new dataset - PitVQA, which is specifically designed for the VQA tasks in endoscopic pituitary surgery. It contains 25 surgical videos and rich question - answer pairs, covering key surgical concepts such as stage identification, step identification, tool detection and localization, and tool - tissue interaction. 2. **Improve the fusion of image and text modalities**: There are inherent differences between image and text information, and how to effectively align these two types of information is an open research challenge. To this end, the paper proposes a new image - guided text embedding method to understand the contextual relationship between the question and the surgical image through joint embedding, cross - attention, and context representation. 3. **Improve the performance of the VQA model**: Based on the GPT2 model, the paper develops PitVQA - Net, a network that combines a new image - guided text embedding and a gated - attention - inspired block classification head. Experimental results show that PitVQA - Net improves the balanced accuracy by 8% and 9% on the PitVQA dataset and the publicly available EndoVis18 - VQA dataset respectively, significantly outperforming the latest baseline models. In conclusion, through the construction of new datasets and improved model architectures, this paper aims to enhance the performance of VQA systems during surgical procedures, thereby supporting surgeons' decision - making and improving surgical efficiency.