DB-SAM: Delving into High Quality Universal Medical Image Segmentation

Chao Qin,Jiale Cao,Huazhu Fu,Fahad Shahbaz Khan,Rao Muhammad Anwer
2024-10-05
Abstract:Recently, the Segment Anything Model (SAM) has demonstrated promising segmentation capabilities in a variety of downstream segmentation tasks. However in the context of universal medical image segmentation there exists a notable performance discrepancy when directly applying SAM due to the domain gap between natural and 2D/3D medical data. In this work, we propose a dual-branch adapted SAM framework, named DB-SAM, that strives to effectively bridge this domain gap. Our dual-branch adapted SAM contains two branches in parallel: a ViT branch and a convolution branch. The ViT branch incorporates a learnable channel attention block after each frozen attention block, which captures domain-specific local features. On the other hand, the convolution branch employs a light-weight convolutional block to extract domain-specific shallow features from the input medical image. To perform cross-branch feature fusion, we design a bilateral cross-attention block and a ViT convolution fusion block, which dynamically combine diverse information of two branches for mask decoder. Extensive experiments on large-scale medical image dataset with various 3D and 2D medical segmentation tasks reveal the merits of our proposed contributions. On 21 3D medical image segmentation tasks, our proposed DB-SAM achieves an absolute gain of 8.8%, compared to a recent medical SAM adapter in the literature. The code and model are available at <a class="link-external link-https" href="https://github.com/AlfredQin/DB-SAM" rel="external noopener nofollow">this https URL</a>.
Image and Video Processing,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the performance degradation when the existing Segment Anything Model (SAM) is directly applied to 2D and 3D medical images in general medical image segmentation tasks. Specifically, due to the large domain gap between natural images and 2D/3D medical images, directly using SAM for medical image segmentation will lead to a significant reduction in segmentation quality. Therefore, the paper proposes a dual - branch adaptation framework, DB - SAM, which aims to effectively bridge this domain gap and improve the performance of SAM in medical image segmentation tasks. ### Main contributions of the paper 1. **Dual - branch framework**: DB - SAM contains two parallel branches - the ViT branch and the convolutional branch. The ViT branch captures domain - specific local features by inserting a learnable channel - attention block after each frozen attention block. The convolutional branch adopts lightweight convolutional blocks to extract domain - specific shallow features from the input medical images. 2. **Cross - branch feature fusion**: Designed bilateral cross - attention blocks and ViT - convolutional fusion blocks to dynamically combine the diverse information of the two branches for the mask decoder. 3. **Experimental verification**: Extensive experiments were carried out on large - scale medical image datasets, covering a variety of 3D and 2D medical segmentation tasks. The experimental results show that in 21 3D medical image segmentation tasks, DB - SAM has an absolute gain of 8.8% compared to the recent medical SAM adapters. ### Formula representation - **Channel - attention block**: \[ F_{\text{out}} = F_{\text{vit}}+\text{Conv}_{1\times1}(\text{SE}(\text{DWConv}_{3\times3}(\text{LN}(F_{\text{vit}})))) \] where \( F_{\text{vit}} \) represents the input embedding from the ViT attention block, \(\text{LN}\) represents layer normalization, \(\text{DWConv}_{3\times3}\) represents depth - wise convolution, \(\text{SE}\) represents the squeeze - and - excitation block, and \(\text{Conv}_{1\times1}\) represents point - wise convolution. - **Final fusion output**: \[ F_{\text{output}} = F_{o}^d\otimes M + F_{o}^s\otimes(1 - M) \] where \( F_{o}^d \) and \( F_{o}^s \) represent the features of the ViT branch and the convolutional branch respectively, \(\otimes\) represents element - wise multiplication, and \( M \) is a selective mask generated by the sigmoid function. ### Conclusion DB - SAM significantly improves the performance of SAM in medical image segmentation tasks by introducing a dual - branch framework and an effective feature fusion mechanism, especially when dealing with small organs and organs with complex shapes.