CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Size Wu,Wenwei Zhang,Lumin Xu,Sheng Jin,Xiangtai Li,Wentao Liu,Chen Change Loy

2024-01-25

Abstract:Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code are released at <a class="link-external link-https" href="https://github.com/wusize/CLIPSelf" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the issue of visual-language alignment in open-vocabulary dense prediction tasks, particularly for CLIP models based on Vision Transformers (ViT). Specifically: 1. **Background**: - Open-vocabulary dense prediction tasks (such as object detection and image segmentation) require models to recognize visual concepts that were not seen in the training data. - Contrastive Language-Image Pre-training (CLIP) models, especially those variants that include Vision Transformers (ViT), have shown excellent performance in zero-shot image classification tasks. - However, when transferring CLIP models from global image representation to local region representation, ViT models face domain shift issues in open-vocabulary dense prediction tasks. 2. **Problem**: - ViT models struggle to directly utilize their global image representation for handling local region representation in open-vocabulary dense prediction tasks, leading to performance degradation. - Specifically, ViT models perform poorly when using dense feature maps for local region representation extraction. 3. **Solution**: - The paper proposes a method called CLIPSelf, which enhances the local region representation capability of ViT models through self-distillation techniques. - CLIPSelf does not require additional text-region paired data but leverages the representation of image patches for self-distillation, thereby enhancing the local region representation of ViT models. Through these improvements, the paper demonstrates that the CLIPSelf method significantly boosts the performance of ViT models in multiple benchmarks, including open-vocabulary object detection, semantic segmentation, and panoptic segmentation.

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

How Much Can CLIP Benefit Vision-and-Language Tasks?

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Generalization Beyond Data Imbalance: A Controlled Study on CLIP for Transferable Insights

CLIPVQA:Video Quality Assessment via CLIP