A Comprehensive Study of Vision Transformers on Dense Prediction Tasks

Kishaan Jeeveswaran,Senthilkumar Kathiresan,Arnav Varma,Omar Magdy,Bahram Zonooz,Elahe Arani

DOI: https://doi.org/10.48550/arXiv.2201.08683

2022-01-21

Abstract:Convolutional Neural Networks (CNNs), architectures consisting of convolutional layers, have been the standard choice in vision tasks. Recent studies have shown that Vision Transformers (VTs), architectures based on self-attention modules, achieve comparable performance in challenging tasks such as object detection and semantic segmentation. However, the image processing mechanism of VTs is different from that of conventional CNNs. This poses several questions about their generalizability, robustness, reliability, and texture bias when used to extract features for complex tasks. To address these questions, we study and compare VT and CNN architectures as feature extractors in object detection and semantic segmentation. Our extensive empirical results show that the features generated by VTs are more robust to distribution shifts, natural corruptions, and adversarial attacks in both tasks, whereas CNNs perform better at higher image resolutions in object detection. Furthermore, our results demonstrate that VTs in dense prediction tasks produce more reliable and less texture-biased predictions.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to evaluate the performance differences between Vision Transformers (VTs) and Convolutional Neural Networks (CNNs) as feature extractors in dense prediction tasks, with a particular focus on their generalization ability, robustness, reliability, and texture bias. Specifically, the paper explores these issues in the following aspects: 1. **Generalization ability**: - **In - distribution data**: Research the performance differences between VTs and CNNs within the training data distribution, especially the accuracy and speed in object detection and semantic segmentation tasks. - **Out - of - distribution data**: Evaluate the performance of these models on unseen datasets to test their generalization ability. 2. **Robustness**: - **Natural corruption**: Simulate natural transformations in the real world, such as weather, lighting, and camera noise, and evaluate the robustness of the models to these changes. - **Adversarial attacks**: Test the resistance of the models to maliciously designed input perturbations, including non - target attacks and target attacks. 3. **Reliability**: - **Model calibration**: Evaluate the correlation between the prediction confidence of the model and its actual accuracy, especially the importance in safety - critical applications (such as autonomous driving). 4. **Texture bias**: - **Texture and shape bias**: Quantify the degree of dependence of the model on texture and shape cues when making predictions to evaluate the robustness and generalization ability of the model. Through the comprehensive analysis of these aspects, the paper aims to comprehensively compare the advantages and disadvantages of VTs and CNNs in complex visual tasks and provide references for future research and applications.

A Comprehensive Study of Vision Transformers on Dense Prediction Tasks

Do Vision Transformers See Like Convolutional Neural Networks?

A Comprehensive Survey of Transformers for Computer Vision

A Comprehensive Study of Vision Transformers in Image Classification Tasks

Vision Transformers for Dense Prediction

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Vision Transformer with Convolutions Architecture Search

DctViT: Discrete Cosine Transform Meet Vision Transformers

Vision Transformers: From Semantic Segmentation to Dense Prediction

KVT: K-Nn Attention for Boosting Vision Transformers.

A survey of the Vision Transformers and their CNN-Transformer based Variants

Visualization Comparison of Vision Transformers and Convolutional Neural Networks

Locality Guidance for Improving Vision Transformers on Tiny Datasets.

Optimizing Vision Transformers with Data-Free Knowledge Transfer

Semantic Segmentation using Vision Transformers: A survey

Vision transformers for dense prediction: A survey

A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis

A Survey on Vision Transformer

Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

CMT: Convolutional Neural Networks Meet Vision Transformers