Abstract:Contrastive Language-Image Pre-training (CLIP) models have shown significant potential, particularly in zero-shot classification across diverse distribution shifts. Building on existing evaluations of overall classification robustness, this work aims to provide a more comprehensive assessment of CLIP by introducing several new perspectives. First, we investigate their robustness to variations in specific visual factors. Second, we assess two critical safety objectives--confidence uncertainty and out-of-distribution detection--beyond mere classification accuracy. Third, we evaluate the finesse with which CLIP models bridge the image and text modalities. Fourth, we extend our examination to 3D awareness in CLIP models, moving beyond traditional 2D image understanding. Finally, we explore the interaction between vision and language encoders within modern large multimodal models (LMMs) that utilize CLIP as the visual backbone, focusing on how this interaction impacts classification robustness. In each aspect, we consider the impact of six factors on CLIP models: model architecture, training distribution, training set size, fine-tuning, contrastive loss, and test-time prompts. Our study uncovers several previously unknown insights into CLIP. For instance, the architecture of the visual encoder in CLIP plays a significant role in their robustness against 3D corruption. CLIP models tend to exhibit a bias towards shape when making predictions. Moreover, this bias tends to diminish after fine-tuning on ImageNet. Vision-language models like LLaVA, leveraging the CLIP vision encoder, could exhibit benefits in classification performance for challenging categories over CLIP alone. Our findings are poised to offer valuable guidance for enhancing the robustness and reliability of CLIP models.

When Does Contrastive Learning Preserve Adversarial Robustness from Pretraining to Finetuning?

Adversarial Supervised Contrastive Learning

Robust Pre-Training by Adversarial Contrastive Learning

Rethinking Robust Contrastive Learning from the Adversarial Perspective

Robustness through Cognitive Dissociation Mitigation in Contrastive Adversarial Training

Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning

Robust Contrastive Learning With Theory Guarantee

Revisiting the Robust Generalization of Adversarial Prompt Tuning

Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning

Adversarial Contrastive Learning via Asymmetric InfoNCE.

On the Adversarial Robustness of Graph Contrastive Learning Methods

Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning

Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

Is Self-Supervised Learning More Robust Than Supervised Learning?

PointACL:Adversarial Contrastive Learning for Robust Point Clouds Representation under Adversarial Attack

Toward a Holistic Evaluation of Robustness in CLIP Models

Adversarial Training with Contrastive Learning in NLP

A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)

Generalized Supervised Contrastive Learning

Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness

CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning