Interpreting CLIP's Image Representation via Text-Based Decomposition

Yossi Gandelsman,Alexei A. Efros,Jacob Steinhardt

2024-03-29

Abstract:We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP's text representation to interpret the summands. Interpreting the attention heads, we characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The problem this paper attempts to address is how to interpret CLIP's image representations by analyzing the impact of various components within the CLIP image encoder on the final representation. Specifically, the authors decompose CLIP's image representation, treating it as a sum of individual image patches, model layers, and attention heads, and use CLIP's text representations to interpret these sum terms. Through this approach, the authors aim to reveal the role of each attention head, discover spatial localization phenomena in images, and utilize these understandings to remove spurious features in CLIP, creating a powerful zero-shot image segmenter. The main contributions of the paper include: 1. **Interpreting Attention Heads**: The authors propose an algorithm (TEXTSPAN) that automatically finds text representations capable of covering the output space of each attention head, thereby revealing the specific roles of many attention heads (e.g., position or shape). 2. **Discovering Spatial Localization in Images**: By interpreting image patches, the authors discover spatial localization phenomena present in CLIP. 3. **Applications and Improvements**: Utilizing the above understandings, the authors remove spurious features in CLIP and create a powerful zero-shot image segmenter, demonstrating that scalable understanding of transformer models can be used to repair and improve models. Overall, this paper provides a new method to interpret and improve the image representations of deep learning models through an in-depth analysis of CLIP's internal structure.

Interpreting CLIP's Image Representation via Text-Based Decomposition

Quantifying and Enabling the Interpretability of CLIP-like Models

Interpreting the Second-Order Effects of Neurons in CLIP

Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

Disentangling visual and written concepts in CLIP

Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances

Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Unveiling Glitches: A Deep Dive into Image Encoding Bugs within CLIP

Simplifying CLIP: Unleashing the Power of Large-Scale Models on Consumer-level Computers

A Joint Encoding Model for Image-Text Matching Based on CLIP

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

Interpreting and Controlling Vision Foundation Models via Text Explanations

Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification

Deciphering the Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels