Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation

Yunheng Li,ZhongYu Li,Quansheng Zeng,Qibin Hou,Ming-Ming Cheng
2024-06-06
Abstract:Pre-trained vision-language models, e.g., CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while they neglect the crucial information in intermediate layers that contain rich object details. However, we find that directly aggregating the multi-level visual features weakens the zero-shot ability for novel classes. The large differences between the visual features from different layers make these features hard to align well with the text embeddings. We resolve this problem by introducing a series of independent decoders to align the multi-level visual features with the text embeddings in a cascaded way, forming a novel but simple framework named Cascade-CLIP. Our Cascade-CLIP is flexible and can be easily applied to existing zero-shot semantic segmentation methods. Experimental results show that our simple Cascade-CLIP achieves superior zero-shot performance on segmentation benchmarks, like COCO-Stuff, Pascal-VOC, and Pascal-Context. Our code is available at: <a class="link-external link-https" href="https://github.com/HVision-NKU/Cascade-CLIP" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily focuses on addressing the problem of Zero-shot Semantic Segmentation, particularly on how to better utilize intermediate layer features when using pre-trained vision-language models (such as CLIP) to improve segmentation performance for new (unseen) categories. Specifically, the paper identifies two key issues with existing CLIP-based methods: 1. **Only using the last layer features**: Most methods only use the last layer features of the CLIP model to align with text embeddings, ignoring the rich object detail information contained in the intermediate layers. 2. **Poor performance of directly fusing multi-layer features**: Although intermediate layer features can capture more local details, directly fusing these features with the last layer features leads to performance degradation due to the significant differences between features from different layers, which disrupts the original vision-language association in CLIP. To address these issues, the authors propose the Cascade-CLIP framework, whose core ideas include: - **Stage-wise processing of the visual encoder**: Dividing the visual encoder of CLIP into multiple stages and assigning independent text-image decoders to each stage to better establish the association between visual features and text embeddings. - **Neighborhood Gaussian Aggregation (NGA) module**: To effectively integrate features from different Transformer blocks within the same stage, an aggregation module is introduced that can adaptively assign different weights based on relative distance. Through the above methods, Cascade-CLIP can better utilize the intermediate layer feature information of the CLIP model, significantly improving segmentation performance for unseen categories, especially excelling in the recognition of local details such as boundaries. Experimental results show that Cascade-CLIP achieves significantly better performance than existing methods on zero-shot segmentation tasks across multiple benchmark datasets (such as COCO-Stuff, Pascal-VOC, and Pascal-Context), particularly showing a notable improvement in the mean Intersection over Union (mIoU) metric for unseen categories.