Abstract:In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation, creating the VRP-SAM model. In essence, VRP-SAM can utilize annotated reference images to comprehend specific objects and perform segmentation of specific objects in target image. It is note that the VRP encoder can support a variety of annotation formats for reference images, including \textbf{point}, \textbf{box}, \textbf{scribble}, and \textbf{mask}. VRP-SAM achieves a breakthrough within the SAM framework by extending its versatility and applicability while preserving SAM's inherent strengths, thus enhancing user-friendliness. To enhance the generalization ability of VRP-SAM, the VRP encoder adopts a meta-learning strategy. To validate the effectiveness of VRP-SAM, we conducted extensive empirical studies on the Pascal and COCO datasets. Remarkably, VRP-SAM achieved state-of-the-art performance in visual reference segmentation with minimal learnable parameters. Furthermore, VRP-SAM demonstrates strong generalization capabilities, allowing it to perform segmentation of unseen objects and enabling cross-domain segmentation. The source code and models will be available at \url{

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the image segmentation task, the existing Segment Anything Model (SAM), when dealing with complex scenes and a large number of images, is inefficient and requires a high level of user familiarity because it depends on specific prompts (such as points, boxes, or rough masks) provided by the user. To this end, the author proposes a new model named VRP - SAM. By introducing a Visual Reference Prompt (VRP) encoder, SAM can use annotated reference images as prompts to segment specific objects in the target image. This method not only improves the adaptability and user - friendliness of SAM but also enhances the generalization ability of the model, enabling it to perform better when dealing with unseen objects and cross - domain scenarios. Specifically, the main contributions of VRP - SAM include: 1. **Introduction of Visual Reference Prompt**: By using annotated reference images as prompts, VRP - SAM can understand specific objects and perform segmentation in the target image, thereby reducing the need for users to provide specific prompts for each image and improving efficiency. 2. **Support for Multiple Annotation Formats**: The VRP encoder can support reference images with multiple annotation formats such as points, boxes, scribbles, and masks, increasing the flexibility of the model. 3. **Enhanced Generalization Ability**: Using a meta - learning strategy, the experimental results of VRP - SAM on different datasets show that it has strong generalization ability and can perform excellently when dealing with unknown objects and cross - domain scenarios. 4. **Retention of SAM's Advantages**: While expanding the functions of SAM, VRP - SAM retains the original advantages of SAM, such as class - independent segmentation ability and high precision. Through these improvements, VRP - SAM has achieved state - of - the - art performance on the Pascal and COCO datasets, especially in the visual reference segmentation task, achieving excellent results with only a small number of learnable parameters.

VRP-SAM: SAM with Visual Reference Prompt

RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

RAP-SAM: Towards Real-Time All-Purpose Segment Anything

SAM-RSIS: Progressively Adapting SAM With Box Prompting to Remote Sensing Image Instance Segmentation

AI-SAM: Automatic and Interactive Segment Anything Model

A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets Prompt Engineering

AM-SAM: Automated Prompting and Mask Calibration for Segment Anything Model

SAM-SP: Self-Prompting Makes SAM Great Again

PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation

SAMP: Adapting Segment Anything Model for Pose Estimation

RSAM-Seg: A SAM-based Approach with Prior Knowledge Integration for Remote Sensing Image Semantic Segmentation

RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation Based on Visual Foundation Model

RefSAM3D: Adapting SAM with Cross-modal Reference for 3D Medical Image Segmentation

Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts

Performance Evaluation of Segment Anything Model with Variational Prompting for Application to Non-Visible Spectrum Imagery

SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in Videos by Prompt Denoising

EmbodiedSAM: Online Segment Any 3D Thing in Real Time

3DSAM: Segment Anything in NeRF

Semantic-Enhanced Point-Box Joint Prompting for Video Object Segmentation

Semantic-SAM: Segment and Recognize Anything at Any Granularity