Abstract:Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification. The class token in the image encoder is trained to capture the global features to distinguish different text descriptions supervised by contrastive loss, making it highly effective for single-label classification. However, it shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class and the contrastive nature of softmax operation aggravates it. In this study, we observe that the multi-label classification results heavily rely on discriminative local features but are overlooked by CLIP. As a result, we dissect the preservation of patch-wise spatial information in CLIP and proposed a local-to-global framework to obtain image tags. It comprises three steps: (1) patch-level classification to obtain coarse scores; (2) dual-masking attention refinement (DMAR) module to refine the coarse scores; (3) class-wise reidentification (CWR) module to remedy predictions from a global perspective. This framework is solely based on frozen CLIP and significantly enhances its multi-label classification performance on various benchmarks without dataset-specific training. Besides, to comprehensively assess the quality and practicality of generated tags, we extend their application to the downstream task, i.e., weakly supervised semantic segmentation (WSSS) with generated tags as image-level pseudo labels. Experiments demonstrate that this classify-then-segment paradigm dramatically outperforms other annotation-free segmentation methods and validates the effectiveness of generated tags. Our code is available at <a class="link-external link-https" href="https://github.com/linyq2117/TagCLIP" rel="external noopener nofollow">this https URL</a>.

TagCLIP: Improving Discrimination Ability of Zero-Shot Semantic Segmentation

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

Exploring Regional Clues in CLIP for Zero-Shot Semantic Segmentation

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation

CLIP Is Also a Good Teacher: A New Learning Framework for Inductive Zero-shot Semantic Segmentation

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

Extract Free Dense Labels from CLIP

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic Segmentation

CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation

[CLS] Token is All You Need for Zero-Shot Semantic Segmentation

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

Transductive Zero-Shot and Few-Shot CLIP