RAP-SAM: Towards Real-Time All-Purpose Segment Anything

Shilin Xu,Haobo Yuan,Qingyu Shi,Lu Qi,Jingbo Wang,Yibo Yang,Yining Li,Kai Chen,Yunhai Tong,Bernard Ghanem,Xiangtai Li,Ming-Hsuan Yang
2024-01-19
Abstract:Advanced by transformer architecture, vision foundation models (VFMs) achieve remarkable progress in performance and generalization ability. Segment Anything Model (SAM) is one remarkable model that can achieve generalized segmentation. However, most VFMs cannot run in realtime, which makes it difficult to transfer them into several products. On the other hand, current real-time segmentation mainly has one purpose, such as semantic segmentation on the driving scene. We argue that diverse outputs are needed for real applications. Thus, this work explores a new real-time segmentation setting, named all-purpose segmentation in real-time, to transfer VFMs in real-time deployment. It contains three different tasks, including interactive segmentation, panoptic segmentation, and video segmentation. We aim to use one model to achieve the above tasks in real-time. We first benchmark several strong baselines. Then, we present Real-Time All Purpose SAM (RAP-SAM). It contains an efficient encoder and an efficient decoupled decoder to perform prompt-driven decoding. Moreover, we further explore different training strategies and tuning methods to boost co-training performance further. Our code and model are available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to develop a general - purpose model capable of performing multi - task segmentation in a real - time environment. Specifically, the paper proposes a new method named "Real - Time All - Purpose Segment Anything Model (RAP - SAM)", aiming to achieve the following goals: 1. **Multi - task Segmentation**: Although most current Vision Foundation Models (VFMs) have made significant progress in performance and generalization ability, they are usually unable to run in real - time, which limits their application in actual products. In addition, most existing real - time segmentation methods focus on a single task, such as semantic segmentation in the autonomous driving scenario. The goal of RAP - SAM is to simultaneously achieve multiple tasks such as interactive segmentation, panoptic segmentation, and video instance segmentation through one model. 2. **Real - time Performance**: The paper pays special attention to how to design an efficient model capable of real - time processing under limited computing resources. This means that the model needs to have low computational complexity and fast inference speed while maintaining high precision. 3. **Generality and Flexibility**: RAP - SAM not only supports the segmentation of images and videos, but also supports interactive segmentation based on user input. This flexibility enables the model to be applied to a wider range of scenarios, such as the integration of editing, tracking, and segmentation functions in various products. ### Specific Problems and Solutions - **Multi - task Joint Training**: To achieve the above goals, the paper proposes a joint training framework that includes three different tasks (interactive segmentation, panoptic segmentation, and video instance segmentation). By conducting joint training on the COCO and YouTube - VIS 2019 datasets, the model can learn the shared knowledge among different tasks, thereby improving the overall performance. - **Efficient Model Architecture**: To achieve real - time performance, the paper designs a lightweight feature extractor and a unified decoder. The feature extractor adopts lightweight backbone networks (such as ResNet18, STDC - v1, and SeaFormer), and fuses multi - scale features through the Feature Pyramid Network (FPN) and deformable convolution. The decoder adopts a pooling - based dynamic convolution framework to improve efficiency. - **Adapter Design**: To balance the requirements of different tasks, the paper introduces two lightweight adapters (a prompt adapter and an object adapter). The prompt adapter is mainly used to enhance the local details of interactive segmentation, while the object adapter is used to consider the scene and temporal features in panoptic segmentation and video segmentation. ### Main Contributions - **Proposing Real - time General - purpose Segmentation**: This is the first model capable of performing multi - task segmentation in a real - time environment. - **Benchmark Testing**: The paper conducts benchmark tests on multiple real - time segmentation methods and verifies the performance advantages of RAP - SAM in multiple tasks. - **Simple and Effective Baseline Model**: RAP - SAM not only achieves the best speed - precision trade - off, but also has a simple architecture that is easy to implement and deploy. In conclusion, through designing an efficient and multi - functional model, this paper solves the deficiencies of existing segmentation methods in terms of real - time performance and multi - task processing, and provides new solutions for practical applications.