Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding

Zhiheng Cheng,Qingyue Wei,Hongru Zhu,Yan Wang,Liangqiong Qu,Wei Shao,Yuyin Zhou
2024-03-27
Abstract:The Segment Anything Model (SAM) has garnered significant attention for its versatile segmentation abilities and intuitive prompt-based interface. However, its application in medical imaging presents challenges, requiring either substantial training costs and extensive medical datasets for full model fine-tuning or high-quality prompts for optimal performance. This paper introduces H-SAM: a prompt-free adaptation of SAM tailored for efficient fine-tuning of medical images via a two-stage hierarchical decoding procedure. In the initial stage, H-SAM employs SAM's original decoder to generate a prior probabilistic mask, guiding a more intricate decoding process in the second stage. Specifically, we propose two key designs: 1) A class-balanced, mask-guided self-attention mechanism addressing the unbalanced label distribution, enhancing image embedding; 2) A learnable mask cross-attention mechanism spatially modulating the interplay among different image regions based on the prior mask. Moreover, the inclusion of a hierarchical pixel decoder in H-SAM enhances its proficiency in capturing fine-grained and localized details. This approach enables SAM to effectively integrate learned medical priors, facilitating enhanced adaptation for medical image segmentation with limited samples. Our H-SAM demonstrates a 4.78% improvement in average Dice compared to existing prompt-free SAM variants for multi-organ segmentation using only 10% of 2D slices. Notably, without using any unlabeled data, H-SAM even outperforms state-of-the-art semi-supervised models relying on extensive unlabeled training data across various medical datasets. Our code is available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently fine - tune large - scale foundation models (such as Segment Anything Model, SAM) in medical image segmentation tasks to adapt to medical image datasets while reducing the dependence on a large amount of labeled data. Specifically, the paper points out that although SAM has very powerful zero - sample segmentation capabilities on natural images, it performs poorly on medical images, mainly because it has not been exposed to medical images during its training process. In addition, existing methods either require a large amount of labeled data for full - model fine - tuning or high - quality prompts to optimize performance, and all of these have certain limitations. Therefore, the paper proposes H - SAM, a prompt - free SAM variant, which aims to achieve efficient model fine - tuning using limited medical data and improve the accuracy of multi - organ segmentation through a two - stage hierarchical decoding process. The main innovation points of H - SAM include: 1. **Hierarchical Decoding**: H - SAM adopts a two - stage hierarchical decoding strategy. In the first stage, the original decoder of SAM is used to generate prior probability masks, and in the second stage, more refined decoding is carried out on this basis. 2. **Class - Balanced Mask - Guided Self - Attention Mechanism**: In order to solve the problem of class imbalance, H - SAM introduces a class - balanced mask - guided self - attention mechanism, which enhances image embeddings by increasing the variation of tail classes. 3. **Learnable Mask Cross - Attention Mechanism**: Through the learnable mask cross - attention mechanism, H - SAM can better regulate the spatial dynamics between different image regions, thereby improving the segmentation effect. 4. **Hierarchical Pixel Decoder**: In order to capture more fine - grained local details, H - SAM also introduces a hierarchical pixel decoder, which further improves the segmentation accuracy in combination with skip connections in the U - Net architecture. The experimental results show that H - SAM performs excellently in multi - organ segmentation tasks, especially in the few - shot setting. Using only 10% of 2D slices, it can achieve an average Dice coefficient of 80.35%, which is significantly better than existing prompt - free SAM variants and other semi - supervised methods. In addition, in prostate and left atrium segmentation tasks, H - SAM also shows excellent performance. Even without using any unlabeled data, it can outperform semi - supervised models that rely on a large amount of unlabeled data.