Abstract:Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes. Traditional methods address this by sequentially connecting a powerful mask proposal generator, such as the Segment Anything Model (SAM), with a pre-trained vision-language model like CLIP. But these two-stage approaches often suffer from high computational costs, memory inefficiencies. In this paper, we propose ESC-Net, a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation within an efficient inference framework. By embedding pseudo prompts generated from image-text correlations into SAM's promptable segmentation framework, ESC-Net achieves refined spatial aggregation for accurate mask predictions. ESC-Net achieves superior performance on standard benchmarks, including ADE20K, PASCAL-VOC, and PASCAL-Context, outperforming prior methods in both efficiency and accuracy. Comprehensive ablation studies further demonstrate its robustness across challenging conditions.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are the computational efficiency and segmentation accuracy in open - vocabulary semantic segmentation. Specifically: 1. **Limitations of traditional methods**: - Traditional two - stage approaches usually combine powerful mask proposal generators (such as Segment Anything Model, SAM) and pre - trained vision - language models (such as CLIP). Although these methods are effective, they have the following problems: - **High computational cost**: The two - stage approach needs to generate mask proposals first and then perform classification, resulting in high consumption of computational resources. - **Inefficient memory usage**: Due to the need to process a large amount of intermediate data, the memory usage efficiency is low. - **Loss of background information**: When cropping image regions, the background information is removed, resulting in a domain gap with the pre - trained model. - **Low - resolution prediction**: Correlation - based methods can only make predictions on low - resolution, highly embedded feature layers and cannot reconstruct detailed local information. 2. **The proposed new method**: - To solve the above problems, the authors propose a new single - stage open - vocabulary semantic segmentation model named ESC - Net (Effective SAM Combination Network). The main improvements of ESC - Net include: - **Efficient inference framework**: By combining SAM decoder blocks, ESC - Net can achieve class - agnostic segmentation tasks while maintaining efficient inference. - **Pseudo - prompt generation**: Use the correlation between CLIP image and text features to generate pseudo - coordinate points and object masks, and embed this information into the prompt encoder of SAM to guide the SAM transformer block. - **Vision - Language Fusion (VLF) module**: Design a VLF module to generate the final category prediction mask through image and text guidance, further improving the segmentation accuracy. 3. **Main contributions**: - Propose ESC - Net, a new single - stage open - vocabulary semantic segmentation model that combines CLIP and SAM, which not only retains the powerful segmentation ability of SAM but also improves the inference efficiency. - Introduce correlation - based pseudo - prompts, enhance image - text modeling, and achieve more accurate and dense prediction masks. - Achieve state - of - the - art performance on standard benchmark datasets such as ADE20K, PASCAL - VOC, and PASCAL - Context, and verify its robustness through extensive ablation experiments. In summary, this paper aims to solve the problems of low computational efficiency and insufficient segmentation accuracy in existing open - vocabulary semantic segmentation methods by introducing ESC - Net, thereby providing a more efficient and accurate solution.

Effective SAM Combination for Open-Vocabulary Semantic Segmentation

PosSAM: Panoptic Open-vocabulary Segment Anything

Open-Vocabulary Semantic Segmentation with Image Embedding Balancing

Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively

SAM Fails to Segment Anything? – SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

AM-SAM: Automated Prompting and Mask Calibration for Segment Anything Model

Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

Region-Based Online Selective Examination for Weakly Supervised Semantic Segmentation

OpenSD: Unified Open-Vocabulary Segmentation and Detection

SAN: Side Adapter Network for Open-vocabulary Semantic Segmentation

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Boosting Segment Anything Model Towards Open-Vocabulary Learning

RSAM-Seg: A SAM-based Approach with Prior Knowledge Integration for Remote Sensing Image Semantic Segmentation

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

MobileSAMv2: Faster Segment Anything to Everything

MeSAM: Multiscale Enhanced Segment Anything Model for Optical Remote Sensing Images

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

Segment Anything without Supervision

Semantic-SAM: Segment and Recognize Anything at Any Granularity

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP