Abstract:Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification. The class token in the image encoder is trained to capture the global features to distinguish different text descriptions supervised by contrastive loss, making it highly effective for single-label classification. However, it shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class and the contrastive nature of softmax operation aggravates it. In this study, we observe that the multi-label classification results heavily rely on discriminative local features but are overlooked by CLIP. As a result, we dissect the preservation of patch-wise spatial information in CLIP and proposed a local-to-global framework to obtain image tags. It comprises three steps: (1) patch-level classification to obtain coarse scores; (2) dual-masking attention refinement (DMAR) module to refine the coarse scores; (3) class-wise reidentification (CWR) module to remedy predictions from a global perspective. This framework is solely based on frozen CLIP and significantly enhances its multi-label classification performance on various benchmarks without dataset-specific training. Besides, to comprehensively assess the quality and practicality of generated tags, we extend their application to the downstream task, i.e., weakly supervised semantic segmentation (WSSS) with generated tags as image-level pseudo labels. Experiments demonstrate that this classify-then-segment paradigm dramatically outperforms other annotation-free segmentation methods and validates the effectiveness of generated tags. Our code is available at <a class="link-external link-https" href="https://github.com/linyq2117/TagCLIP" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to enhance the performance of CLIP in multi - label classification tasks without additional training. Specifically, although the original CLIP performs excellently in single - label classification tasks, it performs poorly in multi - label classification tasks because its global features are often dominated by the most prominent classes, ignoring the local features in the image. In addition, the softmax operation in CLIP's contrastive loss function exacerbates this problem as it creates competition among different classes in a multi - label setting. Therefore, the paper proposes a local - to - global framework (TagCLIP) aiming to improve the performance of multi - label classification by leveraging local features in CLIP. ### Main Contributions 1. **Explore Spatial Information in CLIP**: The paper discovers that the attention operation in the last layer of CLIP destroys spatial information, and based on this, proposes a local - to - global framework (TagCLIP) to enhance the performance of CLIP in multi - label classification tasks without any additional training. 2. **Experimental Verification**: The experimental results show that TagCLIP significantly improves the performance of CLIP in multi - label classification tasks and generates high - quality image labels. Compared with the original CLIP and other methods, TagCLIP achieves significant performance improvements on multiple benchmark datasets. 3. **Downstream Task Applications**: The paper applies the generated image labels to the weakly - supervised semantic segmentation task (WSSS) and finds that this classification - then - segmentation paradigm significantly outperforms other unannotated segmentation methods, verifying the effectiveness of the generated labels. ### Method Overview 1. **Coarse Classification**: Ignore the self - attention operation in the last layer of CLIP and perform block - level classification based on the feature map of the second last layer to obtain the preliminary classification scores for each class. 2. **Dual - Mask Attention Refinement (DMAR)**: By introducing a dual - mask strategy, select high - confidence attention weights and coarse classification scores to further refine the classification scores and reduce noise. 3. **Class - Wide Re - identification (CWR)**: Correct the preliminarily predicted scores from a global perspective, combining local and global information to improve classification accuracy. ### Experimental Results - **Multi - label Classification**: On the PASCAL VOC 2007 and MS COCO 2014 datasets, TagCLIP improves performance by 7.0% and 5.5% respectively, significantly outperforming other methods without additional training. - **Unannotated Semantic Segmentation**: Applying the generated image labels to the weakly - supervised semantic segmentation task, the results show that the classification - then - segmentation paradigm significantly outperforms other unannotated segmentation methods on multiple datasets. In conclusion, this paper effectively solves the shortcomings of CLIP in multi - label classification tasks by proposing a local - to - global framework and also demonstrates the practical application value of the generated image labels in downstream tasks.

TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

TagCLIP: Improving Discrimination Ability of Zero-Shot Semantic Segmentation

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

Decoupling Classification and Localization of CLIP

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

Contrastive Localized Language-Image Pre-Training

DiffCLIP: Few-shot Language-driven Multimodal Classifier

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger.

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation

CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding

Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels