PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery

Runlong He,Mengya Xu,Adrito Das,Danyal Z. Khan,Sophia Bano,Hani J. Marcus,Danail Stoyanov,Matthew J. Clarkson,Mobarakol Islam

2024-05-23

Abstract:Visual Question Answering (VQA) within the surgical domain, utilizing Large Language Models (LLMs), offers a distinct opportunity to improve intra-operative decision-making and facilitate intuitive surgeon-AI interaction. However, the development of LLMs for surgical VQA is hindered by the scarcity of diverse and extensive datasets with complex reasoning tasks. Moreover, contextual fusion of the image and text modalities remains an open research challenge due to the inherent differences between these two types of information and the complexity involved in aligning them. This paper introduces PitVQA, a novel dataset specifically designed for VQA in endonasal pituitary surgery and PitVQA-Net, an adaptation of the GPT2 with a novel image-grounded text embedding for surgical VQA. PitVQA comprises 25 procedural videos and a rich collection of question-answer pairs spanning crucial surgical aspects such as phase and step recognition, context understanding, tool detection and localization, and tool-tissue interactions. PitVQA-Net consists of a novel image-grounded text embedding that projects image and text features into a shared embedding space and GPT2 Backbone with an excitation block classification head to generate contextually relevant answers within the complex domain of endonasal pituitary surgery. Our image-grounded text embedding leverages joint embedding, cross-attention and contextual representation to understand the contextual relationship between questions and surgical images. We demonstrate the effectiveness of PitVQA-Net on both the PitVQA and the publicly available EndoVis18-VQA dataset, achieving improvements in balanced accuracy of 8% and 9% over the most recent baselines, respectively. Our code and dataset is available at

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the technical challenges of visual question answering (VQA) using large - language models (LLMs) in endoscopic pituitary surgery. Specifically, the paper aims to: 1. **Solve the problem of scarce datasets**: Existing datasets for surgical VQA are insufficient in terms of diversity and representation of complex tasks. Therefore, the paper introduces a new dataset - PitVQA, which is specifically designed for the VQA tasks in endoscopic pituitary surgery. It contains 25 surgical videos and rich question - answer pairs, covering key surgical concepts such as stage identification, step identification, tool detection and localization, and tool - tissue interaction. 2. **Improve the fusion of image and text modalities**: There are inherent differences between image and text information, and how to effectively align these two types of information is an open research challenge. To this end, the paper proposes a new image - guided text embedding method to understand the contextual relationship between the question and the surgical image through joint embedding, cross - attention, and context representation. 3. **Improve the performance of the VQA model**: Based on the GPT2 model, the paper develops PitVQA - Net, a network that combines a new image - guided text embedding and a gated - attention - inspired block classification head. Experimental results show that PitVQA - Net improves the balanced accuracy by 8% and 9% on the PitVQA dataset and the publicly available EndoVis18 - VQA dataset respectively, significantly outperforming the latest baseline models. In conclusion, through the construction of new datasets and improved model architectures, this paper aims to enhance the performance of VQA systems during surgical procedures, thereby supporting surgeons' decision - making and improving surgical efficiency.

PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery

SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery

Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery

Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery

Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer

Advancing Surgical VQA with Scene Graph Knowledge

Dual modality prompt learning for visual question-grounded answering in robotic surgery

PitVis-2023 Challenge: Workflow Recognition in videos of Endoscopic Pituitary Surgery

Question-guided Feature Pyramid Network for Medical Visual Question Answering

LLM-Assisted Multi-Teacher Continual Learning for Visual Question Answering in Robotic Surgery

Prior-Posterior Knowledge Prompting-and-Reasoning for Surgical Visual Question Localized-Answering

PathVQA: 30000+ Questions for Medical Visual Question Answering

GP-VLS: A general-purpose vision language model for surgery

Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning

From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities

Visual Question Answering in Ophthalmology: A Progressive and Practical Perspective