Abstract:Foundation models such as the recently introduced Segment Anything Model (SAM) have achieved remarkable results in image segmentation tasks. However, these models typically require user interaction through handcrafted prompts such as bounding boxes, which limits their deployment to downstream tasks. Adapting these models to a specific task with fully labeled data also demands expensive prior user interaction to obtain ground-truth annotations. This work proposes to replace conditioning on input prompts with a lightweight module that directly learns a prompt embedding from the image embedding, both of which are subsequently used by the foundation model to output a segmentation mask. Our foundation models with learnable prompts can automatically segment any specific region by 1) modifying the input through a prompt embedding predicted by a simple module, and 2) using weak labels (tight bounding boxes) and few-shot supervision (10 samples). Our approach is validated on MedSAM, a version of SAM fine-tuned for medical images, with results on three medical datasets in MR and ultrasound imaging. Our code is available on <a class="link-external link-https" href="https://github.com/Minimel/MedSAMWeakFewShotPromptAutomation" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to reduce the dependence on a large amount of labeled data and the need for user interaction in medical image segmentation tasks. Specifically, the paper proposes a method to automate MedSAM (a version of the Segment Anything Model for medical images). By directly learning prompt embeddings from a small number of weakly - labeled samples (i.e., tight bounding boxes), automatic segmentation of specific regions can be achieved. This method aims to improve the performance of the model in the few - shot setting while reducing the cost and complexity of developing specialized segmentation models. ### Core contributions of the paper: 1. **Automated prompt module**: A lightweight prompt module is introduced, which can automatically generate prompt embeddings from the embeddings of the input image, replacing the prompts that originally needed to be manually provided by users. 2. **Weakly - supervised and few - shot learning**: This module can be trained with only a small number of weakly - labeled samples with tight bounding boxes, greatly reducing the need for fully - labeled data. 3. **No need to fine - tune MedSAM**: The proposed module can be directly added to MedSAM without the need to fine - tune MedSAM, maintaining the universality of the base model. ### Method overview: - **Design of the prompt module**: The prompt module consists of two main parts: a convolutional layer for generating dense embeddings and a fully - connected layer for generating sparse embeddings. These two embeddings are combined with the image embeddings of MedSAM to generate the final segmentation mask. - **Design of the loss function**: In order to utilize the weak labels of tight bounding boxes, the paper designs three loss terms: - **Empty region loss** ($L_{\text{empty}}$): Ensure that the area outside the bounding box contains only the background. - **Tight - box constraint loss** ($L_{\text{tightbox}}$): Ensure that at least one foreground pixel passes through each horizontal and vertical line segment. - **Foreground size constraint loss** ($L_{\text{size}}$): Ensure that the size of the predicted foreground area is within a certain range. ### Experimental results: - **Data set**: The paper is verified on three publicly available medical image data sets: HC18, CAMUS, and ACDC. - **Performance comparison**: The experimental results show that even with only 10 samples, the performance of the proposed method on multiple tasks is still better than that of the UNet and TransUNet models trained with fully - labeled data. Especially in the right ventricular (RV) segmentation task, the performance degradation is less. ### Conclusion: The method proposed in the paper effectively automates MedSAM, enabling it to achieve high - quality medical image segmentation with only a small number of weakly - labeled samples. This not only reduces the cost of data labeling but also improves the robustness of the model in the few - shot setting.

Automating MedSAM by Learning Prompts with Weak Few-Shot Supervision

Med-PerSAM: One-Shot Visual Prompt Tuning for Personalized Segment Anything Model in Medical Domain

SAM-MPA: Applying SAM to Few-shot Medical Image Segmentation using Mask Propagation and Auto-prompting

Self-Prompting Large Vision Models for Few-Shot Medical Image Segmentation

Learnable Prompting SAM-induced Knowledge Distillation for Semi-supervised Medical Image Segmentation

MaskSAM: Towards Auto-prompt SAM with Mask Classification for Medical Image Segmentation

ESP-MedSAM: Efficient Self-Prompting SAM for Universal Domain-Generalized Medical Image Segmentation

Self-Sampling Meta SAM: Enhancing Few-shot Medical Image Segmentation with Meta-Learning

Self-Sampling Meta SAM: Enhancing Few-shot Medical Image Segmentation with Meta-Learning

Guided Prompting in SAM for Weakly Supervised Cell Segmentation in Histopathological Images

SAM Fewshot Finetuning for Anatomical Segmentation in Medical Images

SAM on Medical Images: A Comprehensive Study on Three Prompt Modes

Temporally-Extended Prompts Optimization for SAM in Interactive Medical Image Segmentation

Beyond Adapting SAM: Towards End-to-End Ultrasound Image Segmentation via Auto Prompting

Adaptive Prompt Learning with SAM for Few-shot Scanning Probe Microscope Image Segmentation

SAM-SP: Self-Prompting Makes SAM Great Again

Auto-Generating Weak Labels for Real & Synthetic Data to Improve Label-Scarce Medical Image Segmentation

Multi-Prompt Fine-Tuning of Foundation Models for Enhanced Medical Image Segmentation

Sam2Rad: A Segmentation Model for Medical Images with Learnable Prompts

Repurposing Traditional U-Net Predictions for Sparse SAM Prompting in Medical Image Segmentation

Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding