VRP-SAM: SAM with Visual Reference Prompt

Yanpeng Sun,Jiahui Chen,Shan Zhang,Xinyu Zhang,Qiang Chen,Gang Zhang,Errui Ding,Jingdong Wang,Zechao Li
2024-03-30
Abstract:In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation, creating the VRP-SAM model. In essence, VRP-SAM can utilize annotated reference images to comprehend specific objects and perform segmentation of specific objects in target image. It is note that the VRP encoder can support a variety of annotation formats for reference images, including \textbf{point}, \textbf{box}, \textbf{scribble}, and \textbf{mask}. VRP-SAM achieves a breakthrough within the SAM framework by extending its versatility and applicability while preserving SAM's inherent strengths, thus enhancing user-friendliness. To enhance the generalization ability of VRP-SAM, the VRP encoder adopts a meta-learning strategy. To validate the effectiveness of VRP-SAM, we conducted extensive empirical studies on the Pascal and COCO datasets. Remarkably, VRP-SAM achieved state-of-the-art performance in visual reference segmentation with minimal learnable parameters. Furthermore, VRP-SAM demonstrates strong generalization capabilities, allowing it to perform segmentation of unseen objects and enabling cross-domain segmentation. The source code and models will be available at \url{
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the image segmentation task, the existing Segment Anything Model (SAM), when dealing with complex scenes and a large number of images, is inefficient and requires a high level of user familiarity because it depends on specific prompts (such as points, boxes, or rough masks) provided by the user. To this end, the author proposes a new model named VRP - SAM. By introducing a Visual Reference Prompt (VRP) encoder, SAM can use annotated reference images as prompts to segment specific objects in the target image. This method not only improves the adaptability and user - friendliness of SAM but also enhances the generalization ability of the model, enabling it to perform better when dealing with unseen objects and cross - domain scenarios. Specifically, the main contributions of VRP - SAM include: 1. **Introduction of Visual Reference Prompt**: By using annotated reference images as prompts, VRP - SAM can understand specific objects and perform segmentation in the target image, thereby reducing the need for users to provide specific prompts for each image and improving efficiency. 2. **Support for Multiple Annotation Formats**: The VRP encoder can support reference images with multiple annotation formats such as points, boxes, scribbles, and masks, increasing the flexibility of the model. 3. **Enhanced Generalization Ability**: Using a meta - learning strategy, the experimental results of VRP - SAM on different datasets show that it has strong generalization ability and can perform excellently when dealing with unknown objects and cross - domain scenarios. 4. **Retention of SAM's Advantages**: While expanding the functions of SAM, VRP - SAM retains the original advantages of SAM, such as class - independent segmentation ability and high precision. Through these improvements, VRP - SAM has achieved state - of - the - art performance on the Pascal and COCO datasets, especially in the visual reference segmentation task, achieving excellent results with only a small number of learnable parameters.