Effective SAM Combination for Open-Vocabulary Semantic Segmentation

Minhyeok Lee,Suhwan Cho,Jungho Lee,Sunghun Yang,Heeseung Choi,Ig-Jae Kim,Sangyoun Lee
2024-11-22
Abstract:Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes. Traditional methods address this by sequentially connecting a powerful mask proposal generator, such as the Segment Anything Model (SAM), with a pre-trained vision-language model like CLIP. But these two-stage approaches often suffer from high computational costs, memory inefficiencies. In this paper, we propose ESC-Net, a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation within an efficient inference framework. By embedding pseudo prompts generated from image-text correlations into SAM's promptable segmentation framework, ESC-Net achieves refined spatial aggregation for accurate mask predictions. ESC-Net achieves superior performance on standard benchmarks, including ADE20K, PASCAL-VOC, and PASCAL-Context, outperforming prior methods in both efficiency and accuracy. Comprehensive ablation studies further demonstrate its robustness across challenging conditions.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are the computational efficiency and segmentation accuracy in open - vocabulary semantic segmentation. Specifically: 1. **Limitations of traditional methods**: - Traditional two - stage approaches usually combine powerful mask proposal generators (such as Segment Anything Model, SAM) and pre - trained vision - language models (such as CLIP). Although these methods are effective, they have the following problems: - **High computational cost**: The two - stage approach needs to generate mask proposals first and then perform classification, resulting in high consumption of computational resources. - **Inefficient memory usage**: Due to the need to process a large amount of intermediate data, the memory usage efficiency is low. - **Loss of background information**: When cropping image regions, the background information is removed, resulting in a domain gap with the pre - trained model. - **Low - resolution prediction**: Correlation - based methods can only make predictions on low - resolution, highly embedded feature layers and cannot reconstruct detailed local information. 2. **The proposed new method**: - To solve the above problems, the authors propose a new single - stage open - vocabulary semantic segmentation model named ESC - Net (Effective SAM Combination Network). The main improvements of ESC - Net include: - **Efficient inference framework**: By combining SAM decoder blocks, ESC - Net can achieve class - agnostic segmentation tasks while maintaining efficient inference. - **Pseudo - prompt generation**: Use the correlation between CLIP image and text features to generate pseudo - coordinate points and object masks, and embed this information into the prompt encoder of SAM to guide the SAM transformer block. - **Vision - Language Fusion (VLF) module**: Design a VLF module to generate the final category prediction mask through image and text guidance, further improving the segmentation accuracy. 3. **Main contributions**: - Propose ESC - Net, a new single - stage open - vocabulary semantic segmentation model that combines CLIP and SAM, which not only retains the powerful segmentation ability of SAM but also improves the inference efficiency. - Introduce correlation - based pseudo - prompts, enhance image - text modeling, and achieve more accurate and dense prediction masks. - Achieve state - of - the - art performance on standard benchmark datasets such as ADE20K, PASCAL - VOC, and PASCAL - Context, and verify its robustness through extensive ablation experiments. In summary, this paper aims to solve the problems of low computational efficiency and insufficient segmentation accuracy in existing open - vocabulary semantic segmentation methods by introducing ESC - Net, thereby providing a more efficient and accurate solution.