Do Vision-Language Foundational models show Robust Visual Perception?

Shivam Chandhok,Pranav Tandon
2024-08-13
Abstract:Recent advances in vision-language foundational models have enabled development of systems that can perform visual understanding and reasoning tasks. However, it is unclear if these models are robust to distribution shifts, and how their performance and generalization capabilities vary under changes in data distribution. In this project we strive to answer the question "Are vision-language foundational models robust to distribution shifts like human perception?" Specifically, we consider a diverse range of vision-language models and compare how the performance of these systems is affected by corruption based distribution shifts (such as \textit{motion blur, fog, snow, gaussian noise}) commonly found in practical real-world scenarios. We analyse the generalization capabilities qualitatively and quantitatively on zero-shot image classification task under aforementioned distribution shifts. Our code will be avaible at \url{<a class="link-external link-https" href="https://github.com/shivam-chandhok/CPSC-540-Project" rel="external noopener nofollow">this https URL</a>}
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **the robustness of Vision - Language Foundational Models under distribution shifts**. Specifically, researchers are concerned about the performance and generalization ability of these models when facing common image degradations in the real world (such as motion blur, fog, snow, Gaussian noise, etc.). ### Problem Background In recent years, the development of Vision - Language Foundational Models has enabled systems to perform visual understanding and reasoning tasks, and their performance is close to the human visual perception level. However, the performance of these models in practical applications may be affected by changes in the data distribution. Traditional supervised learning models assume that the distribution of test data is the same as that of training data, but in the real world, this assumption is often not valid. Therefore, when these models encounter data with different distributions, their performance may drop significantly. ### Research Objectives This paper aims to answer the following question: "**Are Vision - Language Foundational Models as robust to distribution shifts as human perception?**" To this end, researchers selected a variety of Vision - Language Foundational Models and analyzed their performance under different types of image degradations. Through quantitative and qualitative analysis, researchers hope to understand the generalization ability and robustness of these models in zero - shot image classification tasks. ### Specific Research Contents 1. **Model Selection**: Researchers selected a variety of Vision - Language Foundational Models, including multi - encoder models based on contrastive learning (such as CLIP), encoder - decoder models based on generation tasks, and hybrid models (such as CoCa and BLIP2). 2. **Dataset and Evaluation Protocol**: The CIFAR10 and PASCAL VOC datasets were used for experiments to evaluate the zero - shot classification performance of models under different severities of image degradations. 3. **Experimental Design**: By introducing common image degradations (such as motion blur, fog, snow, Gaussian noise), researchers analyzed the impact of these degradations on model performance. The experimental results show that hybrid models (such as CoCa and BLIP2) exhibit better robustness and generalization ability when dealing with distribution shifts. ### Main Findings - **Hybrid Models Are More Robust**: Hybrid models (such as CoCa and BLIP2), due to the combination of objective functions of contrastive learning and generation tasks, can maintain good performance and semantic representation under different image degradations. - **Vision Transformers Are Superior to Convolutional Networks**: Models based on vision transformers (such as CLIP ViT) are significantly superior to models based on convolutional networks (such as CLIP ResNet) in terms of robustness and generalization ability. - **Gaussian Noise Has the Greatest Impact**: Among all types of image degradations, Gaussian noise has the greatest impact on model performance, while the impact of snow and fog is relatively small. ### Conclusion The research in this paper shows that hybrid models (such as CoCa and BLIP2) exhibit stronger robustness and generalization ability when dealing with distribution shifts. Therefore, it is recommended to give priority to using hybrid models based on vision transformers in practical applications (such as in the fields of autonomous driving and robotics) to obtain better performance and robustness. ### References - Radford, A., et al. (2021). Learning transferable visual models from natural language supervision. - Hendrycks, D., & Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations. - Li, J., et al. (2023). BLIP - 2: bootstrapping language - image pre - training with frozen image encoders and large language models. - Yu, J., et al. (2022). Coca: Cont