Semantic-SAM: Segment and Recognize Anything at Any Granularity

Feng Li,Hao Zhang,Peize Sun,Xueyan Zou,Shilong Liu,Jianwei Yang,Chunyuan Li,Lei Zhang,Jianfeng Gao
2023-07-11
Abstract:In this paper, we introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Our model offers two key advantages: semantic-awareness and granularity-abundance. To achieve semantic-awareness, we consolidate multiple datasets across three granularities and introduce decoupled classification for objects and parts. This allows our model to capture rich semantic information. For the multi-granularity capability, we propose a multi-choice learning scheme during training, enabling each click to generate masks at multiple levels that correspond to multiple ground-truth masks. Notably, this work represents the first attempt to jointly train a model on SA-1B, generic, and part segmentation datasets. Experimental results and visualizations demonstrate that our model successfully achieves semantic-awareness and granularity-abundance. Furthermore, combining SA-1B training with other segmentation tasks, such as panoptic and part segmentation, leads to performance improvements. We will provide code and a demo for further exploration and evaluation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to develop a general - purpose image segmentation model that can segment and recognize images at any desired granularity level. Specifically, the paper proposes **Semantic - SAM**, which is a general - purpose image segmentation model aiming to achieve the following two key advantages: 1. **Semantic - Awareness**: The model can understand the semantic meaning behind each segmented region. To achieve this, the authors integrated multiple datasets of different granularities and trained on decoupled object and part classification tasks, thereby promoting knowledge transfer between rich semantic information. 2. **Granularity - Abundance**: The model can generate segmentation masks at multiple granularity levels. To this end, the authors proposed a multi - choice learning scheme, so that each click point can generate masks of multiple levels corresponding to multiple ground - truth masks. ### Specific Problems and Solutions #### 1. Flexibility of the Model Architecture Existing image segmentation model architectures mainly adopt single - input - single - output pipelines, which limit the model's ability to predict multi - granularity segmentation masks in an end - to - end manner. Although clustering post - processing techniques can generate multiple masks for a single object query, these methods are neither efficient nor effective. **Solutions**: - **Multi - choice Learning Design**: Introduce a multi - choice learning design in the decoder architecture. Each click point is represented as multiple queries, and each query contains embeddings of different levels. These queries learn from all available ground - truth masks, thereby ensuring that a single click point can generate high - quality masks at multiple granularity levels. #### 2. Availability of Training Data Expanding segmentation datasets with semantic awareness and granularity awareness is an expensive task. Existing general - purpose object and segmentation datasets (such as MSCOCO and Objects365) provide a large amount of data and rich semantic information, but are limited to the object level. While part - segmentation datasets (such as Pascal Part, PartImageNet and PACO) provide finer - grained semantic annotations, but the amount of data is limited. **Solutions**: - **Data Unification**: Integrate seven datasets covering three granularity levels, including general - purpose segmentation (MSCOCO, Objects365, ADE20k), part - segmentation (Pascal Part, PACO, PartImageNet) and class - free multi - granularity (SA - 1B). The data format is reorganized to match the training objective. After joint training, the model shows strong performance on multiple datasets. ### Experimental Results - **General - purpose Segmentation**: Experiments on the COCO validation set show that compared with using only COCO data, jointly training SA - 1B and COCO data can significantly improve the performance of instance - level detection and segmentation. - **Part - segmentation**: Experiments on the Pascal Part dataset show that adding SA - 1B data can bring performance improvement. - **Single - granularity Interactive Segmentation**: In the 1 - click mIoU evaluation on the COCO validation set, Semantic - SAM outperforms SAM. - **Multi - granularity Interactive Segmentation**: Experiments on the SA - 1B subset show that Semantic - SAM performs better in multi - granularity prediction, especially in the average IoU score. ### Conclusion Semantic - SAM is the first model to successfully attempt joint training of SA - 1B and other classic segmentation datasets. The experimental results show that training with SA - 1B data can improve the performance of other tasks. Comprehensive experiments and visualizations verify the semantic awareness ability and multi - granularity ability of the model.