RoSIS: Robust Framework for Text-Promptable Surgical Instrument Segmentation Using Vision-Language Fusion

Tae-Min Choi,Juyoun Park
2024-11-19
Abstract:Surgical instrument segmentation (SIS) is an essential task in computer-assisted surgeries, with deep learning-based research improving accuracy in complex environments. Recently, text-promptable segmentation methods have been introduced to generate masks based on text prompts describing target objects. However, these methods assume that the object described by a given text prompt exists in the scene. This results in mask generation whenever a related text prompt is provided, even if the object is absent from the image. Existing methods handle this by using prompts only for objects known to be present in the image, which introduces inaccessible information in a vision-based method setting and results in unfair comparisons. For fair comparison, we redefine existing text-promptable SIS settings to robust conditions, called Robust text-promptable SIS (R-SIS), designed to forward prompts of all classes and determine the existence of an object from a given text prompt for the fair comparison. Furthermore, we propose a novel framework, Robust Surgical Instrument Segmentation (RoSIS), which combines visual and language features for promptable segmentation in the R-SIS setting. RoSIS employs an encoder-decoder architecture with a Multi-Modal Fusion Block (MMFB) and a Selective Gate Block (SGB) to achieve balanced integration of vision and language features. Additionally, we introduce an iterative inference strategy that refines segmentation masks in two steps: an initial pass using name-based prompts, followed by a refinement step using location prompts. Experiments on various datasets and settings demonstrate that RoSIS outperforms existing vision-based and promptable methods under robust conditions.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that in computer - assisted surgery, the existing text - prompt - based surgical instrument segmentation (SIS) methods will generate incorrect segmentation masks when facing objects that do not exist in the image. Specifically: 1. **Problems of existing methods**: - The existing text - prompt segmentation methods assume that the objects described by the given text must exist in the image. Therefore, even if the object is not in the image, the model will generate a segmentation mask. - This assumption leads to inaccurate segmentation results and unfair comparisons because these methods use inaccessible information (i.e., the existence of objects in the image), which is not feasible in the evaluation of visual foundation models. 2. **Redefining the problem**: - For fair comparison, the author redefines the text - prompt surgical instrument segmentation task as **Robust Text - Prompt Surgical Instrument Segmentation (R - SIS)**. R - SIS requires the model to be able to handle text prompts of all categories and determine whether the object of the given text prompt exists in the image, thereby avoiding generating incorrect segmentation masks. 3. **The proposed new framework**: - For this purpose, the author proposes a new framework **RoSIS (Robust Surgical Instrument Segmentation)** that combines visual and linguistic features to achieve robust text - prompt segmentation. - RoSIS adopts an encoder - decoder architecture and introduces a multimodal fusion block (MMFB) and a selection gate block (SGB) to balance the fusion of visual and linguistic features. - In addition, RoSIS also introduces an iterative reasoning strategy to refine the segmentation mask through two - step reasoning (first using the name prompt and then using the position prompt). 4. **Objectives**: - In this way, RoSIS can provide more accurate and robust segmentation results in complex surgical environments, reduce false positives, and improve segmentation accuracy. In summary, this paper aims to solve the inaccuracy problem of existing text - prompt surgical instrument segmentation methods when dealing with non - existent objects and achieve this goal by proposing a new robust framework RoSIS.