Understanding the Vulnerability of CLIP to Image Compression

Cangxiong Chen,Vinay P. Namboodiri,Julian Padget

2023-11-23

Abstract:CLIP is a widely used foundational vision-language model that is used for zero-shot image recognition and other image-text alignment tasks. We demonstrate that CLIP is vulnerable to change in image quality under compression. This surprising result is further analysed using an attribution method-Integrated Gradients. Using this attribution method, we are able to better understand both quantitatively and qualitatively exactly the nature in which the compression affects the zero-shot recognition accuracy of this model. We evaluate this extensively on CIFAR-10 and STL-10. Our work provides the basis to understand this vulnerability of CLIP and can help us develop more effective methods to improve the robustness of CLIP and other vision-language models.

Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

The article mainly explores the vulnerability of CLIP (Contrastive Language-Image Pretraining) in zero-shot image recognition tasks after image compression. The study found that although CLIP demonstrates robustness to distributional shifts on various datasets, it is highly sensitive to changes in image quality, where the predicted text labels change significantly after compression. To explain this phenomenon, the paper uses an attribution method called Integrated Gradients to analyze how changes in image quality affect the model's predictions through quantification and visualization. The authors conducted experiments on the CIFAR-10 and STL-10 datasets, demonstrating the decline in recognition accuracy of CLIP when handling images of different qualities. They used Integrated Gradients to detect this vulnerability and found that this method can effectively quantify the pixel-level factors that impact CLIP's predictions. Moreover, Integrated Gradients satisfies sensitivity and invariance properties, making it an ideal analysis tool. The main contributions of the paper include: 1. Demonstrating the sensitivity of CLIP to image quality when performing zero-shot image recognition. 2. Using the Integrated Gradients method to investigate the impact of quality changes on predictions, providing numerical estimates and visual explanations. Future work directions include developing strategies to enhance the robustness of CLIP and other underlying models. Through this research, a better understanding of the behavior of CLIP under image compression can be gained, thereby aiding in improving the model's stability.

Understanding the Vulnerability of CLIP to Image Compression

Unveiling Glitches: A Deep Dive into Image Encoding Bugs within CLIP

Can Image Compression Rely on CLIP?

Toward a Holistic Evaluation of Robustness in CLIP Models

Fooling Contrastive Language-Image Pre-trained Models with CLIPMasterPrints

Delving into the Openness of CLIP

Enhancing Robustness of CLIP to Common Corruptions through Bimodal Test-Time Adaptation

Interpreting CLIP: Insights on the Robustness to ImageNet Distribution Shifts

A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)

Interpreting CLIP's Image Representation via Text-Based Decomposition

Benchmarking PathCLIP for Pathology Image Analysis

WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP

Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot Study

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

Improving CLIP Robustness with Knowledge Distillation and Self-Training

TagCLIP: Improving Discrimination Ability of Zero-Shot Semantic Segmentation

On Erroneous Agreements of CLIP Image Embeddings

A study of the effect of JPG compression on adversarial images