Abstract:Segment Anything Model (SAM) has gained significant recognition in the field of semantic segmentation due to its versatile capabilities and impressive performance. Despite its success, SAM faces two primary limitations: (1) it relies heavily on meticulous human-provided prompts like key points, bounding boxes or text messages, which is labor-intensive; (2) the mask decoder's feature representation is sometimes inaccurate, as it solely employs dot product operations at the end of mask decoder, which inadequately captures the necessary correlations for precise segmentation. Current solutions to these problems such as fine-tuning SAM often require retraining a large number of parameters, which needs huge amount of time and computing resources. To address these limitations, we propose an automated prompting and mask calibration method called AM-SAM based on a bi-level optimization framework. Our approach automatically generates prompts for an input image, eliminating the need for human involvement with a good performance in early training epochs, achieving faster convergence. Additionally, we freeze the main part of SAM, and modify the mask decoder with Low-Rank Adaptation (LoRA), enhancing the mask decoder's feature representation by incorporating advanced techniques that go beyond simple dot product operations to more accurately capture and utilize feature correlations. Our experimental results demonstrate that AM-SAM achieves significantly accurate segmentation, matching or exceeding the effectiveness of human-generated and default prompts. Notably, on the body segmentation dataset, our method yields a 5% higher dice score with a 4-example few-shot training set compared to the SOTA method, underscoring its superiority in semantic segmentation tasks.

What problem does this paper attempt to address?

This paper attempts to solve two main problems of the Segment Anything Model (SAM) in semantic segmentation tasks: 1. **Dependence on elaborate human prompts**: SAM requires key points, bounding boxes or text information provided by humans, which is very time - consuming and labor - intensive in practical applications. For example, in real - time road scenes, users need to provide multiple bounding boxes to improve the segmentation effect. Moreover, in specific fields such as medical image segmentation, due to the huge differences in data distribution, even if appropriate prompts are provided, SAM cannot segment these images well. 2. **Inaccurate feature representation of the mask decoder**: The mask decoder of SAM only uses dot - product operations to generate masks, which cannot fully capture complex feature correlations, resulting in inaccurate feature representation and thus affecting the segmentation accuracy. To solve these problems, the authors propose AM - SAM (Automated Prompting and Mask Calibration for SAM), a method based on a two - level optimization framework. Specific improvements include: - **Automatic prompt generation**: By introducing an object detector (such as YOLOv8) to automatically generate bounding boxes as initial prompts, the dependence on human prompts is eliminated, and the convergence speed and accuracy in the early training stage are significantly improved. - **Mask calibration**: Modify the mask decoder of SAM, and enhance feature fusion by introducing element - level multiplication (Hadamard Product), so as to more accurately capture and utilize feature correlations and improve the segmentation accuracy. The experimental results show that AM - SAM performs excellently on multiple datasets, especially in few - shot learning scenarios, and its performance is better than existing methods. For example, on the human body segmentation dataset, AM - SAM can achieve a Dice score of 81.3% with only 4 samples, which is 5 percentage points higher than the previous state - of - the - art method BLO - SAM. ### Summary AM - SAM solves the problems of SAM's dependence on human prompts and inaccurate feature representation through automated prompt generation and mask calibration, significantly improving the performance of semantic segmentation tasks, especially in fields such as few - shot learning and medical image segmentation.

AM-SAM: Automated Prompting and Mask Calibration for Segment Anything Model

PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation

MaskSAM: Towards Auto-prompt SAM with Mask Classification for Medical Image Segmentation

SAM Fails to Segment Anything? – SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More

SAMP: Adapting Segment Anything Model for Pose Estimation

MeSAM: Multiscale Enhanced Segment Anything Model for Optical Remote Sensing Images

Stable Segment Anything Model

SAM-MPA: Applying SAM to Few-shot Medical Image Segmentation using Mask Propagation and Auto-prompting

Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding

SAM-Adapter: Adapting Segment Anything in Underperformed Scenes

SAM-SP: Self-Prompting Makes SAM Great Again

MapSAM: Adapting Segment Anything Model for Automated Feature Detection in Historical Maps

Segment Anything in High Quality

AGSAM: Agent-Guided Segment Anything Model for Automatic Segmentation in Few-Shot Scenarios

SAM-RSIS: Progressively Adapting SAM With Box Prompting to Remote Sensing Image Instance Segmentation

CoSAM: Self-Correcting SAM for Domain Generalization in 2D Medical Image Segmentation

AI-SAM: Automatic and Interactive Segment Anything Model

PointSAM: Pointly-Supervised Segment Anything Model for Remote Sensing Images

All-in-SAM: from Weak Annotation to Pixel-wise Nuclei Segmentation with Prompt-based Finetuning

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

SAM Fewshot Finetuning for Anatomical Segmentation in Medical Images