MaskSAM: Towards Auto-prompt SAM with Mask Classification for Medical Image Segmentation

Bin Xie,Hao Tang,Bin Duan,Dawen Cai,Yan Yan
2024-03-21
Abstract:Segment Anything Model~(SAM), a prompt-driven foundation model for natural image segmentation, has demonstrated impressive zero-shot performance. However, SAM does not work when directly applied to medical image segmentation tasks, since SAM lacks the functionality to predict semantic labels for predicted masks and needs to provide extra prompts, such as points or boxes, to segment target regions. Meanwhile, there is a huge gap between 2D natural images and 3D medical images, so the performance of SAM is imperfect for medical image segmentation tasks. Following the above issues, we propose MaskSAM, a novel mask classification prompt-free SAM adaptation framework for medical image segmentation. We design a prompt generator combined with the image encoder in SAM to generate a set of auxiliary classifier tokens, auxiliary binary masks, and auxiliary bounding boxes. Each pair of auxiliary mask and box prompts, which can solve the requirements of extra prompts, is associated with class label predictions by the sum of the auxiliary classifier token and the learnable global classifier tokens in the mask decoder of SAM to solve the predictions of semantic labels. Meanwhile, we design a 3D depth-convolution adapter for image embeddings and a 3D depth-MLP adapter for prompt embeddings. We inject one of them into each transformer block in the image encoder and mask decoder to enable pre-trained 2D SAM models to extract 3D information and adapt to 3D medical images. Our method achieves state-of-the-art performance on AMOS2022, 90.52% Dice, which improved by 2.7% compared to nnUNet. Our method surpasses nnUNet by 1.7% on ACDC and 1.0% on Synapse datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced in extending the Segment Anything Model (SAM) from natural image segmentation tasks to medical image segmentation tasks. Specifically, when SAM is directly applied to medical image segmentation, there are the following main problems: 1. **Lack of semantic label prediction function**: The binary masks generated by SAM do not contain any semantic labels, while medical image segmentation tasks usually involve multiple objects with different semantic labels. 2. **Requirement for additional prompts**: SAM requires users to provide precise prompts (such as points or boxes) to segment the target area, which may be difficult to achieve in practical applications, especially without medical knowledge. 3. **Insufficient support for 3D medical images**: SAM is mainly used to process 2D natural images, while many medical scan data are 3D volume data (such as MRI and CT), so its performance is not good when processing 3D medical images. To overcome these problems, the paper proposes MaskSAM, a prompt - free SAM adaptation framework specifically for medical image segmentation. The main contributions of MaskSAM include: 1. **Proposing a prompt - free architecture**: By designing a prompt generator, it automatically generates auxiliary binary masks and bounding boxes as prompts, eliminating the need for manual prompting. 2. **Introducing auxiliary classifier tokens**: The prompt generator simultaneously generates auxiliary classifier tokens, which are combined with learnable global classifier tokens, enabling the model to predict the semantic labels of each binary mask. 3. **Designing 3D deep convolution adapters and 3D deep MLP adapters**: These adapters are injected into each transformer block of the image encoder and mask decoder, enabling the pre - trained 2D SAM model to extract 3D information and adapt to 3D medical image segmentation tasks. 4. **Conducting extensive experimental verification**: Experiments were carried out on three challenging datasets (AMOS2022, ACDC, and Synapse), and the results show that MaskSAM achieved state - of - the - art performance in the Dice coefficient, improving by 2.7%, 1.7%, and 1.0% respectively compared to nnUNet. Through these improvements, MaskSAM not only retains the zero - shot ability of SAM, but also successfully adapts it to medical image segmentation tasks, significantly improving performance.