Abstract:Modern operating room is becoming increasingly complex, requiring innovative intra-operative support systems. While the focus of surgical data science has largely been on video analysis, integrating surgical computer vision with language capabilities is emerging as a necessity. Our work aims to advance Visual Question Answering (VQA) in the surgical context with scene graph knowledge, addressing two main challenges in the current surgical VQA systems: removing question-condition bias in the surgical VQA dataset and incorporating scene-aware reasoning in the surgical VQA model design. First, we propose a Surgical Scene Graph-based dataset, SSG-QA, generated by employing segmentation and detection models on publicly available datasets. We build surgical scene graphs using spatial and action information of instruments and anatomies. These graphs are fed into a question engine, generating diverse QA pairs. Our SSG-QA dataset provides a more complex, diverse, geometrically grounded, unbiased, and surgical action-oriented dataset compared to existing surgical VQA datasets. We then propose SSG-QA-Net, a novel surgical VQA model incorporating a lightweight Scene-embedded Interaction Module (SIM), which integrates geometric scene knowledge in the VQA model design by employing cross-attention between the textual and the scene features. Our comprehensive analysis of the SSG-QA dataset shows that SSG-QA-Net outperforms existing methods across different question types and complexities. We highlight that the primary limitation in the current surgical VQA systems is the lack of scene knowledge to answer complex queries. We present a novel surgical VQA dataset and model and show that results can be significantly improved by incorporating geometric scene features in the VQA model design. The source code and the dataset will be made publicly available at: <a class="link-external link-https" href="https://github.com/CAMMA-public/SSG-QA" rel="external noopener nofollow">this https URL</a>

Prior-Posterior Knowledge Prompting-and-Reasoning for Surgical Visual Question Localized-Answering

Dual modality prompt learning for visual question-grounded answering in robotic surgery

Debiasing Medical Visual Question Answering via Counterfactual Training

Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery

Advancing Surgical VQA with Scene Graph Knowledge

Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer

PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery

Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery

Question-guided Feature Pyramid Network for Medical Visual Question Answering

Targeted Visual Prompting for Medical Visual Question Answering

LaPA: Latent Prompt Assist Model For Medical Visual Question Answering

Medical knowledge-based network for Patient-oriented Visual Question Answering

Candidate-Heuristic In-Context Learning: A new framework for enhancing medical visual question answering with LLMs

Revisiting Distillation for Continual Learning on Visual Question Localized-Answering in Robotic Surgery

Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation

Perceptual Visual Reasoning with Knowledge Propagation

Medical Visual Question Answering via Conditional Reasoning and Contrastive Learning

Path-RAG: Knowledge-Guided Key Region Retrieval for Open-ended Pathology Visual Question Answering

CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery