Advancing surgical VQA with scene graph knowledge

Kun Yuan,Manasi Kattel,Joël L. Lavanchy,Nassir Navab,Vinkle Srivastav,Nicolas Padoy
DOI: https://doi.org/10.1007/s11548-024-03141-y
2024-05-25
International Journal of Computer Assisted Radiology and Surgery
Abstract:The modern operating room is becoming increasingly complex, requiring innovative intra-operative support systems. While the focus of surgical data science has largely been on video analysis, integrating surgical computer vision with natural language capabilities is emerging as a necessity. Our work aims to advance visual question answering (VQA) in the surgical context with scene graph knowledge, addressing two main challenges in the current surgical VQA systems: removing question–condition bias in the surgical VQA dataset and incorporating scene-aware reasoning in the surgical VQA model design.
engineering, biomedical,radiology, nuclear medicine & medical imaging,surgery
What problem does this paper attempt to address?
The problems that this paper attempts to solve are that there are two main challenges in the surgical Visual Question Answering (surgical VQA) system: one is to remove the question - conditional bias in the surgical VQA dataset; the other is to integrate scene - perception reasoning into the surgical VQA model design. Current surgical VQA systems often overlook detailed scene knowledge, which results in their limited ability to answer complex queries. Therefore, the paper proposes a new surgical - scene - graph - based VQA dataset (SSG - VQA) and a new - type surgical VQA model (SSG - VQA - Net), aiming to significantly improve the performance of the surgical VQA system by introducing geometric scene features. Specifically, the paper first constructs a dataset SSG - VQA based on the surgical - scene - graph. This dataset is generated by segmentation and detection models and uses spatial and action information to establish surgical - scene - graphs. These scene - graphs are input into a question engine to generate diverse question - answer pairs. Secondly, the paper proposes a new surgical VQA model SSG - VQA - Net, which contains a lightweight Scene - embedded Interaction Module (SIM). By applying a cross - attention mechanism between text and scene features, it integrates geometric scene knowledge into the VQA model design. Through experimental results, the paper shows that the SSG - VQA dataset is more complex, diverse, has a solid geometric foundation and is oriented towards surgical actions compared to existing surgical VQA datasets. Meanwhile, SSG - VQA - Net outperforms existing methods in different question types and complexity levels, especially in complex questions that require visual reasoning. In addition, the paper points out that the main bottleneck of current surgical visual question - answering models lies in learning the encoding representation rather than the decoding sequence. Finally, the paper provides a diagnostic benchmark for testing the model's scene - understanding and reasoning abilities, and makes the source code and dataset publicly available.