Abstract:Recent advances in vision-language foundational models have enabled development of systems that can perform visual understanding and reasoning tasks. However, it is unclear if these models are robust to distribution shifts, and how their performance and generalization capabilities vary under changes in data distribution. In this project we strive to answer the question "Are vision-language foundational models robust to distribution shifts like human perception?" Specifically, we consider a diverse range of vision-language models and compare how the performance of these systems is affected by corruption based distribution shifts (such as \textit{motion blur, fog, snow, gaussian noise}) commonly found in practical real-world scenarios. We analyse the generalization capabilities qualitatively and quantitatively on zero-shot image classification task under aforementioned distribution shifts. Our code will be avaible at \url{<a class="link-external link-https" href="https://github.com/shivam-chandhok/CPSC-540-Project" rel="external noopener nofollow">this https URL</a>}

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **the robustness of Vision - Language Foundational Models under distribution shifts**. Specifically, researchers are concerned about the performance and generalization ability of these models when facing common image degradations in the real world (such as motion blur, fog, snow, Gaussian noise, etc.). ### Problem Background In recent years, the development of Vision - Language Foundational Models has enabled systems to perform visual understanding and reasoning tasks, and their performance is close to the human visual perception level. However, the performance of these models in practical applications may be affected by changes in the data distribution. Traditional supervised learning models assume that the distribution of test data is the same as that of training data, but in the real world, this assumption is often not valid. Therefore, when these models encounter data with different distributions, their performance may drop significantly. ### Research Objectives This paper aims to answer the following question: "**Are Vision - Language Foundational Models as robust to distribution shifts as human perception?**" To this end, researchers selected a variety of Vision - Language Foundational Models and analyzed their performance under different types of image degradations. Through quantitative and qualitative analysis, researchers hope to understand the generalization ability and robustness of these models in zero - shot image classification tasks. ### Specific Research Contents 1. **Model Selection**: Researchers selected a variety of Vision - Language Foundational Models, including multi - encoder models based on contrastive learning (such as CLIP), encoder - decoder models based on generation tasks, and hybrid models (such as CoCa and BLIP2). 2. **Dataset and Evaluation Protocol**: The CIFAR10 and PASCAL VOC datasets were used for experiments to evaluate the zero - shot classification performance of models under different severities of image degradations. 3. **Experimental Design**: By introducing common image degradations (such as motion blur, fog, snow, Gaussian noise), researchers analyzed the impact of these degradations on model performance. The experimental results show that hybrid models (such as CoCa and BLIP2) exhibit better robustness and generalization ability when dealing with distribution shifts. ### Main Findings - **Hybrid Models Are More Robust**: Hybrid models (such as CoCa and BLIP2), due to the combination of objective functions of contrastive learning and generation tasks, can maintain good performance and semantic representation under different image degradations. - **Vision Transformers Are Superior to Convolutional Networks**: Models based on vision transformers (such as CLIP ViT) are significantly superior to models based on convolutional networks (such as CLIP ResNet) in terms of robustness and generalization ability. - **Gaussian Noise Has the Greatest Impact**: Among all types of image degradations, Gaussian noise has the greatest impact on model performance, while the impact of snow and fog is relatively small. ### Conclusion The research in this paper shows that hybrid models (such as CoCa and BLIP2) exhibit stronger robustness and generalization ability when dealing with distribution shifts. Therefore, it is recommended to give priority to using hybrid models based on vision transformers in practical applications (such as in the fields of autonomous driving and robotics) to obtain better performance and robustness. ### References - Radford, A., et al. (2021). Learning transferable visual models from natural language supervision. - Hendrycks, D., & Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations. - Li, J., et al. (2023). BLIP - 2: bootstrapping language - image pre - training with frozen image encoders and large language models. - Yu, J., et al. (2022). Coca: Cont

Do Vision-Language Foundational models show Robust Visual Perception?

Robust Computer Vision in an Ever-Changing World: A Survey of Techniques for Tackling Distribution Shifts

Open-Vocabulary Object Detectors: Robustness Challenges under Distribution Shifts

Robustness Analysis of Video-Language Models Against Visual and Language Perturbations

Can Self-Supervised Representation Learning Methods Withstand Distribution Shifts and Corruptions?

Refining Skewed Perceptions in Vision-Language Models through Visual Representations

How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?

Foundational Models Defining a New Era in Vision: A Survey and Outlook

How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation

Robust Fine-Tuning of Vision-Language Models for Domain Generalization

ViewFool: Evaluating the Robustness of Visual Recognition to Adversarial Viewpoints

Analyzing the Roles of Language and Vision in Learning from Limited Data

A Survey on the Robustness of Computer Vision Models against Common Corruptions

ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models

Uncovering the Hidden Dynamics of Video Self-supervised Learning under Distribution Shifts

Computer Vision Datasets and Models Exhibit Cultural and Linguistic Diversity in Perception

Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

What's "up" with vision-language models? Investigating their struggle with spatial reasoning

Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques

Interpreting CLIP: Insights on the Robustness to ImageNet Distribution Shifts

Can 3D Vision-Language Models Truly Understand Natural Language?