Abstract:What distinguishes robust models from non-robust ones? While for ImageNet distribution shifts it has been shown that such differences in robustness can be traced back predominantly to differences in training data, so far it is not known what that translates to in terms of what the model has learned. In this work, we bridge this gap by probing the representation spaces of 16 robust zero-shot CLIP vision encoders with various backbones (ResNets and ViTs) and pretraining sets (OpenAI, LAION-400M, LAION-2B, YFCC15M, CC12M and {DataComp}), and comparing them to the representation spaces of less robust models with identical backbones, but different (pre)training sets or objectives (CLIP pretraining on ImageNet-Captions, and supervised training or finetuning on ImageNet).Through this analysis, we generate three novel insights. Firstly, we detect the presence of outlier features in robust zero-shot CLIP vision encoders, which to the best of our knowledge is the first time these are observed in non-language and non-transformer models. Secondly, we find the existence of outlier features to be an indication of ImageNet shift robustness in models, since we only find them in robust models in our analysis. Lastly, we also investigate the number of unique encoded concepts in the representation space and find zero-shot CLIP models to encode a higher number of unique concepts in their representation space. However, we do not find this to be an indicator of ImageNet shift robustness and hypothesize that it is rather related to the language supervision. Since the presence of outlier features can be detected without access to any data from shifted datasets, we believe that they could be a useful tool for practitioners to get a feeling for the distribution shift robustness of a pretrained model during deployment.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to explore and understand the robust features of the CLIP model when faced with ImageNet distribution shifts. Specifically, the authors attempt to identify the differences between robust and non-robust models by analyzing the representation space of CLIP models with different pre-training datasets and architectures. The main objectives of the paper include: 1. **Detecting Outlier Features in Robust Models**: The authors found that robust zero-shot CLIP visual encoders exhibit outlier features, which have activation values significantly higher than the average values of other features in the same layer. This phenomenon is observed for the first time in non-language and non-Transformer models. 2. **Outlier Features as Indicators of Robustness**: Through comparative analysis, the authors discovered that the presence of outlier features can serve as an indicator of a model's robustness to ImageNet distribution shifts, as these features are only observed in robust models. 3. **Number of Unique Concepts**: The authors also studied the number of unique concepts in the representation space and found that zero-shot CLIP models encode more unique concepts in their representation space. However, this is not a direct indicator of robustness but may be related to language supervision. ### Background and Methods - **CLIP Model**: CLIP is a large-scale pre-trained multimodal model that aligns images and text in an unsupervised manner, thus performing well in zero-shot image classification tasks. - **Robustness Measurement**: The authors used two methods to measure the robustness of the models: the ratio of original performance to post-shift performance, and Effective Robustness (ER). - **Model Pool**: The authors analyzed various architectures (ResNet50, ResNet101, ViT-B-16, ViT-B-32, ViT-L-14) and pre-training datasets (OpenAI, LAION-400M, LAION-2B, YFCC15M, CC12M, DataComp) of CLIP models, as well as models fine-tuned or supervised on ImageNet. ### Main Findings 1. **Outlier Features in Robust Models**: - Robust zero-shot CLIP models exhibit outlier features with activation values significantly higher than other features. - These outlier features propagate through the weight matrix of downstream classifiers, forming so-called "privileged directions" that are crucial for model predictions. 2. **Outlier Features as Indicators of Robustness**: - Comparative analysis revealed that outlier features are present only in robust models and absent in non-robust models. - This indicates that the presence of outlier features can serve as an indicator of a model's robustness to ImageNet distribution shifts. 3. **Number of Unique Concepts**: - Robust zero-shot CLIP models encode more unique concepts in their representation space. - However, this is not a direct indicator of robustness, as some non-robust ImageNet-Captions pre-trained CLIP models also encode a large number of unique concepts. - The authors speculate that the increase in the number of unique concepts may be related to language supervision. ### Conclusion - **Features of Robust Models**: Robust zero-shot CLIP models exhibit outlier features, and the presence of these features can serve as an indicator of a model's robustness to ImageNet distribution shifts. - **Number of Unique Concepts**: Although robust models encode more unique concepts, this is not a direct indicator of robustness but may be related to language supervision. - **Roots of Robustness**: Robust models are typically pre-trained on larger and more diverse datasets, which may be the common root of both robustness and outlier features. Through these findings, the authors provide new perspectives for understanding the robustness of CLIP models and offer potential methods for diagnosing and improving model robustness.

Interpreting CLIP: Insights on the Robustness to ImageNet Distribution Shifts

Toward a Holistic Evaluation of Robustness in CLIP Models

A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)

Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot Study

Improving CLIP Robustness with Knowledge Distillation and Self-Training

Do CLIPs Always Generalize Better than ImageNet Models?

Effective Robustness against Natural Distribution Shifts for Models with Different Training Data

Robust Fine-Tuning of Vision-Language Models for Domain Generalization

What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

Delving into the Openness of CLIP

Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances

Models Out of Line: A Fourier Lens on Distribution Shift Robustness

Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization

Benchmarking Low-Shot Robustness to Natural Distribution Shifts

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP

Exploring the Adversarial Robustness of CLIP for AI-generated Image Detection

Generalization Beyond Data Imbalance: A Controlled Study on CLIP for Transferable Insights

Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP