Rethinking RGB-D Fusion for Semantic Segmentation in Surgical Datasets

Muhammad Abdullah Jamal,Omid Mohareri
2024-07-29
Abstract:Surgical scene understanding is a key technical component for enabling intelligent and context aware systems that can transform various aspects of surgical interventions. In this work, we focus on the semantic segmentation task, propose a simple yet effective multi-modal (RGB and depth) training framework called SurgDepth, and show state-of-the-art (SOTA) results on all publicly available datasets applicable for this task. Unlike previous approaches, which either fine-tune SOTA segmentation models trained on natural images, or encode RGB or RGB-D information using RGB only pre-trained backbones, SurgDepth, which is built on top of Vision Transformers (ViTs), is designed to encode both RGB and depth information through a simple fusion mechanism. We conduct extensive experiments on benchmark datasets including EndoVis2022, AutoLapro, LapI2I and EndoVis2017 to verify the efficacy of SurgDepth. Specifically, SurgDepth achieves a new SOTA IoU of 0.86 on EndoVis 2022 SAR-RARP50 challenge and outperforms the current best method by at least 4%, using a shallow and compute efficient decoder consisting of ConvNeXt blocks.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the surgical dataset, how to improve the performance of the semantic segmentation task by effectively fusing RGB (color images) and depth information. Specifically, the author proposes a new multimodal (RGB and depth) training framework, SurgDepth, to improve the accuracy, generalization ability, and clinical application potential of surgical scene understanding. ### Problem Background Surgical scene understanding is a key technical component for implementing intelligent and context - aware systems, which can change various aspects of surgical interventions. The semantic segmentation task aims to classify tools, anatomical structures, and other objects in the surgical scene at the pixel level. However, existing methods face many challenges when dealing with surgical data, such as occlusion, illumination changes, the presence of smoke and blood, and diverse instrument and tissue types, which limit the accuracy and universality of existing methods. ### Solution To solve the above problems, the author proposes SurgDepth, a new framework based on Vision Transformers (ViTs), which combines RGB and depth information through a simple and effective fusion mechanism. The main contributions of SurgDepth include: 1. **New RGB - D training framework**: SurgDepth is used for semantic segmentation in surgical scenes. 2. **3D - aware fusion block**: This module enhances the positioning of objects and structures by fusing 3D geometric information in the depth map. 3. **Light - weight decoder**: A shallow decoder based on ConvNeXt blocks is used to generate segmentation maps. ### Experimental Results The author conducted extensive experiments on multiple benchmark datasets to verify the effectiveness of SurgDepth. In particular, in the EndoVis2022 SAR - RARP50 challenge, SurgDepth achieved an intersection - over - union (IoU) of 0.86, which is at least 4% higher than the current best method. In addition, SurgDepth also achieved new state - of - the - art (SOTA) performance on other datasets such as AutoLapro, LapI2I, and CholecSeg8k. ### Summary By introducing 3D geometric information, SurgDepth significantly improves the performance of semantic segmentation in surgical scenes and performs well in terms of computational resource consumption. This provides strong support for future surgical data analysis and automation. ### Related Formulas Some of the key formulas involved in the paper are as follows: - **Attention mechanism of the 3D - aware fusion block**: \[ Q = FC(\text{AdaptivePool}_{k \times k}(\text{Concat}(X_{\text{rgb}}^i, X_{\text{depth}}^i))) \] \[ K = FC(X_{\text{rgb}}^i), \quad V = FC(X_{\text{rgb}}^i) \] \[ X_{\text{fusion}} = \text{Bilinear}(V \cdot \text{Softmax}(\frac{Q^\top K}{\sqrt{C_d}})) \] where \( Q \) is the query feature, \( K \) and \( V \) are the key and value features respectively, and \( C_d \) is the dimension of \( Q \), \( K \) and \( V \). Hopefully, this information can help you better understand the research content and contributions of this paper.