Abstract:Contrastive language-image pre-training (CLIP) is a powerful vision-language model that has shown great benefits for various tasks. However, we have identified some issues with its explainability, which undermine its credibility and limit the capacity for related tasks. Specifically, we find that CLIP tends to focus on background regions rather than foregrounds, with noisy activations at irrelevant positions on the visualization results. These phenomena conflict with conventional explainability methods based on the class attention map (CAM), where the raw model can highlight the local foreground regions using global supervision without alignment. To address these problems, we take a closer look at its architecture and features. Based on thorough analyses, we find the raw self-attentions link to inconsistent semantic regions, resulting in the opposite visualization. Besides, the noisy activations are owing to redundant features among categories. Building on these insights, we propose the CLIP Surgery for reliable CAM, a method that allows surgery-like modifications to the inference architecture and features, without further fine-tuning as classical CAM methods. This approach significantly improves the explainability of CLIP, surpassing existing methods by large margins. Besides, it enables multimodal visualization and extends the capacity of raw CLIP on open-vocabulary tasks without extra alignment. The code is available at <a class="link-external link-https" href="https://github.com/xmed-lab/CLIP_Surgery" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve the problems in the interpretability of the CLIP (Contrastive Language - Image Pre - training) model. Specifically, the author finds that there are two main problems in the CLIP model when generating explanation graphs: 1. **Background takes precedence over foreground**: CLIP tends to focus on the background area rather than the foreground area, which is contrary to the way humans perceive. 2. **Noise activation**: There are many noise activations in irrelevant positions, and these noises affect the credibility and interpretability of the model. These problems make CLIP perform poorly in tasks requiring high interpretability, such as semantic segmentation, image retrieval, and generation. To improve these problems, the author conducts an in - depth analysis of the CLIP's architecture and features and proposes the "CLIP Surgery" method to improve the interpretability of CLIP through surgical - like modifications. ### Main contributions 1. **Problem discovery**: The author observes the opposite visualization and noise activation phenomena in CLIP and finds that these phenomena are caused by the inconsistent self - attention mechanism and redundant features among classes. 2. **Proposing solutions**: Based on the above findings, the author proposes "CLIP Surgery", including architecture surgery and feature surgery, which can significantly improve the interpretability of CLIP without further fine - tuning. 3. **Wide applicability**: This method not only performs well on multiple datasets and different backbone networks but also is applicable to multi - modal visualization and open - vocabulary tasks. ### Method overview 1. **Architecture surgery**: - **Consistent self - attention**: By using homogeneous parameters to construct the self - attention mechanism, the problem of the self - attention layer in CLIP connecting inconsistent semantic regions is solved. - **Two - path structure**: By deleting some feed - forward network (FFN) modules, the negative impact of these modules on the final prediction is avoided, thereby improving the model's interpretability. 2. **Feature surgery**: - **Identifying redundant features**: By calculating the average features in the class dimension, redundant features are identified and removed from the final similarity map, reducing noise activation. ### Experimental results - **Interpretability tasks**: On multiple datasets such as PASCAL VOC 2012, MS COCO 2017, PASCAL Context, and ImageNet - Segmentation - 50, CLIP Surgery significantly improves the mIoU and mSC metrics. - **Multi - label recognition tasks**: Through feature surgery, the performance of CLIP in multi - label recognition tasks is improved. - **Multi - modal visualization**: CLIP Surgery can generate high - quality multi - modal visualization results to explain the learning process of CLIP. In conclusion, this paper proposes an effective improvement method through in - depth analysis of the CLIP's architecture and features, which significantly improves the interpretability of CLIP in various tasks.

A Closer Look at the Explainability of Contrastive Language-Image Pre-training

CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks

A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)

Improving Visual Counterfactual Explanation Models for Image Classification via CLIP

CLIP in Medical Imaging: A Comprehensive Survey

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

Delving into the Openness of CLIP

Improving CLIP Training with Language Rewrites

Non-Contrastive Learning Meets Language-Image Pre-Training

Contrastive Localized Language-Image Pre-Training

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training

DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks

Toward a Holistic Evaluation of Robustness in CLIP Models

Quantifying and Enabling the Interpretability of CLIP-like Models

What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights