Understanding the Vulnerability of CLIP to Image Compression

Cangxiong Chen,Vinay P. Namboodiri,Julian Padget
2023-11-23
Abstract:CLIP is a widely used foundational vision-language model that is used for zero-shot image recognition and other image-text alignment tasks. We demonstrate that CLIP is vulnerable to change in image quality under compression. This surprising result is further analysed using an attribution method-Integrated Gradients. Using this attribution method, we are able to better understand both quantitatively and qualitatively exactly the nature in which the compression affects the zero-shot recognition accuracy of this model. We evaluate this extensively on CIFAR-10 and STL-10. Our work provides the basis to understand this vulnerability of CLIP and can help us develop more effective methods to improve the robustness of CLIP and other vision-language models.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The article mainly explores the vulnerability of CLIP (Contrastive Language-Image Pretraining) in zero-shot image recognition tasks after image compression. The study found that although CLIP demonstrates robustness to distributional shifts on various datasets, it is highly sensitive to changes in image quality, where the predicted text labels change significantly after compression. To explain this phenomenon, the paper uses an attribution method called Integrated Gradients to analyze how changes in image quality affect the model's predictions through quantification and visualization. The authors conducted experiments on the CIFAR-10 and STL-10 datasets, demonstrating the decline in recognition accuracy of CLIP when handling images of different qualities. They used Integrated Gradients to detect this vulnerability and found that this method can effectively quantify the pixel-level factors that impact CLIP's predictions. Moreover, Integrated Gradients satisfies sensitivity and invariance properties, making it an ideal analysis tool. The main contributions of the paper include: 1. Demonstrating the sensitivity of CLIP to image quality when performing zero-shot image recognition. 2. Using the Integrated Gradients method to investigate the impact of quality changes on predictions, providing numerical estimates and visual explanations. Future work directions include developing strategies to enhance the robustness of CLIP and other underlying models. Through this research, a better understanding of the behavior of CLIP under image compression can be gained, thereby aiding in improving the model's stability.