Interpreting CLIP's Image Representation via Text-Based Decomposition

Yossi Gandelsman,Alexei A. Efros,Jacob Steinhardt
2024-03-29
Abstract:We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP's text representation to interpret the summands. Interpreting the attention heads, we characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem this paper attempts to address is how to interpret CLIP's image representations by analyzing the impact of various components within the CLIP image encoder on the final representation. Specifically, the authors decompose CLIP's image representation, treating it as a sum of individual image patches, model layers, and attention heads, and use CLIP's text representations to interpret these sum terms. Through this approach, the authors aim to reveal the role of each attention head, discover spatial localization phenomena in images, and utilize these understandings to remove spurious features in CLIP, creating a powerful zero-shot image segmenter. The main contributions of the paper include: 1. **Interpreting Attention Heads**: The authors propose an algorithm (TEXTSPAN) that automatically finds text representations capable of covering the output space of each attention head, thereby revealing the specific roles of many attention heads (e.g., position or shape). 2. **Discovering Spatial Localization in Images**: By interpreting image patches, the authors discover spatial localization phenomena present in CLIP. 3. **Applications and Improvements**: Utilizing the above understandings, the authors remove spurious features in CLIP and create a powerful zero-shot image segmenter, demonstrating that scalable understanding of transformer models can be used to repair and improve models. Overall, this paper provides a new method to interpret and improve the image representations of deep learning models through an in-depth analysis of CLIP's internal structure.