How to Efficiently Adapt Large Segmentation Model(SAM) to Medical Images

Xinrong Hu,Xiaowei Xu,Yiyu Shi
2023-06-24
Abstract:The emerging scale segmentation model, Segment Anything (SAM), exhibits impressive capabilities in zero-shot segmentation for natural images. However, when applied to medical images, SAM suffers from noticeable performance drop. To make SAM a real ``foundation model" for the computer vision community, it is critical to find an efficient way to customize SAM for medical image dataset. In this work, we propose to freeze SAM encoder and finetune a lightweight task-specific prediction head, as most of weights in SAM are contributed by the encoder. In addition, SAM is a promptable model, while prompt is not necessarily available in all application cases, and precise prompts for multiple class segmentation are also time-consuming. Therefore, we explore three types of prompt-free prediction heads in this work, include ViT, CNN, and linear layers. For ViT head, we remove the prompt tokens in the mask decoder of SAM, which is named AutoSAM. AutoSAM can also generate masks for different classes with one single inference after modification. To evaluate the label-efficiency of our finetuning method, we compare the results of these three prediction heads on a public medical image segmentation dataset with limited labeled data. Experiments demonstrate that finetuning SAM significantly improves its performance on medical image dataset, even with just one labeled volume. Moreover, AutoSAM and CNN prediction head also has better segmentation accuracy than training from scratch and self-supervised learning approaches when there is a shortage of annotations.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently adapt the large - scale Segment Anything Model (SAM) to medical images. Although SAM performs excellently in zero - shot segmentation tasks of natural images, its performance drops significantly when applied to medical images. Therefore, the paper proposes a method to customize SAM for medical image datasets by freezing the weights of the SAM encoder and fine - tuning a lightweight task - specific prediction head. In addition, since accurate prompts may not be available in medical image segmentation tasks, the paper also explores three prompt - free prediction heads, namely the prediction heads based on Vision Transformer (ViT), Convolutional Neural Network (CNN) and linear layer. These improvements aim to enhance the performance of the model with a small amount of labeled data and reduce the dependence on a large amount of labeled data.