Abstract:Recent works utilize CLIP to perform the challenging unsupervised semantic segmentation task where only images without annotations are available. However, we observe that when adopting CLIP to such a pixel-level understanding task, unexpected bias (including class-preference bias and space-preference bias) occurs. Previous works don't explicitly model the bias, which largely constrains the segmentation performance. In this paper, we propose to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation task. Specifically, we design a learnable ''Reference'' prompt to encode class-preference bias and a projection of the positional embedding in vision transformer to encode space-preference bias respectively. To avoid interference, two kinds of biases are firstly independently encoded into the Reference feature and the positional feature. Via a matrix multiplication between two features, a bias logit map is generated to explicitly represent two kinds of biases. Then we rectify the logits of CLIP via a simple element-wise subtraction. To make the rectified results smoother and more contextual, we design a mask decoder which takes the feature of CLIP and rectified logits as input and outputs a rectified segmentation mask with the help of Gumbel-Softmax operation. To make the bias modeling and rectification process meaningful and effective, a contrastive loss based on masked visual features and the text features of different classes is imposed. To further improve the segmentation, we distill the knowledge from the rectified CLIP to the advanced segmentation architecture via minimizing our designed mask-guided, feature-guided and text-guided loss terms. Extensive experiments on various benchmarks demonstrate that ReCLIP++ performs favorably against previous SOTAs. The implementation is available at: <a class="link-external link-https" href="https://github.com/dogehhh/ReCLIP" rel="external noopener nofollow">this https URL</a>.

WeakCLIP: Adapting CLIP for Weakly-Supervised Semantic Segmentation

CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation

Toward Modality Gap: Vision Prototype Learning for Weakly-supervised Semantic Segmentation with CLIP

Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

SemPLeS: Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation

SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger.

Exploit CAM by Itself: Complementary Learning System for Weakly Supervised Semantic Segmentation

ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

Exploit CAM by itself: Complementary Learning System for Weakly Supervised Semantic Segmentation

APC: Adaptive Patch Contrast for Weakly Supervised Semantic Segmentation

CLIP for Lightweight Semantic Segmentation

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation