Abstract:Pre-trained vision-language models, e.g., CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while they neglect the crucial information in intermediate layers that contain rich object details. However, we find that directly aggregating the multi-level visual features weakens the zero-shot ability for novel classes. The large differences between the visual features from different layers make these features hard to align well with the text embeddings. We resolve this problem by introducing a series of independent decoders to align the multi-level visual features with the text embeddings in a cascaded way, forming a novel but simple framework named Cascade-CLIP. Our Cascade-CLIP is flexible and can be easily applied to existing zero-shot semantic segmentation methods. Experimental results show that our simple Cascade-CLIP achieves superior zero-shot performance on segmentation benchmarks, like COCO-Stuff, Pascal-VOC, and Pascal-Context. Our code is available at: <a class="link-external link-https" href="https://github.com/HVision-NKU/Cascade-CLIP" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The paper primarily focuses on addressing the problem of Zero-shot Semantic Segmentation, particularly on how to better utilize intermediate layer features when using pre-trained vision-language models (such as CLIP) to improve segmentation performance for new (unseen) categories. Specifically, the paper identifies two key issues with existing CLIP-based methods: 1. **Only using the last layer features**: Most methods only use the last layer features of the CLIP model to align with text embeddings, ignoring the rich object detail information contained in the intermediate layers. 2. **Poor performance of directly fusing multi-layer features**: Although intermediate layer features can capture more local details, directly fusing these features with the last layer features leads to performance degradation due to the significant differences between features from different layers, which disrupts the original vision-language association in CLIP. To address these issues, the authors propose the Cascade-CLIP framework, whose core ideas include: - **Stage-wise processing of the visual encoder**: Dividing the visual encoder of CLIP into multiple stages and assigning independent text-image decoders to each stage to better establish the association between visual features and text embeddings. - **Neighborhood Gaussian Aggregation (NGA) module**: To effectively integrate features from different Transformer blocks within the same stage, an aggregation module is introduced that can adaptively assign different weights based on relative distance. Through the above methods, Cascade-CLIP can better utilize the intermediate layer feature information of the CLIP model, significantly improving segmentation performance for unseen categories, especially excelling in the recognition of local details such as boundaries. Experimental results show that Cascade-CLIP achieves significantly better performance than existing methods on zero-shot segmentation tasks across multiple benchmark datasets (such as COCO-Stuff, Pascal-VOC, and Pascal-Context), particularly showing a notable improvement in the mean Intersection over Union (mIoU) metric for unseen categories.

Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation

ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation

A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model.

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

CLIP Is Also a Good Teacher: A New Learning Framework for Inductive Zero-shot Semantic Segmentation

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic Segmentation

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

TagCLIP: Improving Discrimination Ability of Zero-Shot Semantic Segmentation

CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation

Exploring Regional Clues in CLIP for Zero-Shot Semantic Segmentation

SegCLIP: Multimodal Visual-Language and Prompt Learning for High-Resolution Remote Sensing Semantic Segmentation

Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models

[CLS] Token is All You Need for Zero-Shot Semantic Segmentation

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation