A Comprehensive Study of Vision Transformers on Dense Prediction Tasks

Kishaan Jeeveswaran,Senthilkumar Kathiresan,Arnav Varma,Omar Magdy,Bahram Zonooz,Elahe Arani
DOI: https://doi.org/10.48550/arXiv.2201.08683
2022-01-21
Abstract:Convolutional Neural Networks (CNNs), architectures consisting of convolutional layers, have been the standard choice in vision tasks. Recent studies have shown that Vision Transformers (VTs), architectures based on self-attention modules, achieve comparable performance in challenging tasks such as object detection and semantic segmentation. However, the image processing mechanism of VTs is different from that of conventional CNNs. This poses several questions about their generalizability, robustness, reliability, and texture bias when used to extract features for complex tasks. To address these questions, we study and compare VT and CNN architectures as feature extractors in object detection and semantic segmentation. Our extensive empirical results show that the features generated by VTs are more robust to distribution shifts, natural corruptions, and adversarial attacks in both tasks, whereas CNNs perform better at higher image resolutions in object detection. Furthermore, our results demonstrate that VTs in dense prediction tasks produce more reliable and less texture-biased predictions.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate the performance differences between Vision Transformers (VTs) and Convolutional Neural Networks (CNNs) as feature extractors in dense prediction tasks, with a particular focus on their generalization ability, robustness, reliability, and texture bias. Specifically, the paper explores these issues in the following aspects: 1. **Generalization ability**: - **In - distribution data**: Research the performance differences between VTs and CNNs within the training data distribution, especially the accuracy and speed in object detection and semantic segmentation tasks. - **Out - of - distribution data**: Evaluate the performance of these models on unseen datasets to test their generalization ability. 2. **Robustness**: - **Natural corruption**: Simulate natural transformations in the real world, such as weather, lighting, and camera noise, and evaluate the robustness of the models to these changes. - **Adversarial attacks**: Test the resistance of the models to maliciously designed input perturbations, including non - target attacks and target attacks. 3. **Reliability**: - **Model calibration**: Evaluate the correlation between the prediction confidence of the model and its actual accuracy, especially the importance in safety - critical applications (such as autonomous driving). 4. **Texture bias**: - **Texture and shape bias**: Quantify the degree of dependence of the model on texture and shape cues when making predictions to evaluate the robustness and generalization ability of the model. Through the comprehensive analysis of these aspects, the paper aims to comprehensively compare the advantages and disadvantages of VTs and CNNs in complex visual tasks and provide references for future research and applications.