Abstract:We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at <a class="link-external link-https" href="https://github.com/intel-isl/DPT" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the key challenges in **Dense Prediction Tasks**. Specifically, the authors introduce a new architecture - **Dense Vision Transformer (DPT)** to replace the traditional convolutional network as the backbone network for dense prediction tasks. The following are the main problems that this paper attempts to solve: 1. **Feature Resolution and Detail Preservation**: - When dealing with dense prediction tasks, convolutional networks usually gradually reduce the resolution of the input image through down - sampling operations to extract multi - scale features. However, this approach will lead to the loss of feature resolution and details in the deep layers, making it difficult for the decoder to recover this information. - The method proposed in the paper uses Vision Transformer (ViT) to avoid explicit down - sampling operations and maintain a constant spatial resolution throughout the processing. This helps to preserve finer - grained features. 2. **Global Context Awareness**: - The local receptive field of convolutional networks limits their ability to capture the global context, especially when dealing with tasks that require global consistency (such as semantic segmentation, monocular depth estimation, etc.). - Vision Transformer can achieve a global receptive field at each stage through the Multi - Head Self - Attention (MHSA) mechanism, thereby better capturing global context information. 3. **Improving the Performance of Dense Prediction Tasks**: - The authors have experimentally verified that DPT significantly outperforms existing convolutional - network - based models in multiple dense prediction tasks, especially when there is a large amount of training data. For example, in the monocular depth estimation task, DPT has a performance improvement of more than 28% compared to the state - of - the - art convolutional - network - based model. - In the semantic segmentation task, DPT has also reached a new state - of - the - art level on the ADE20K dataset, with an mIoU of 49.02%. 4. **The Ability to Adapt to Small - scale Datasets**: - The paper also shows that DPT can be fine - tuned on smaller datasets and still achieve excellent performance. For example, in the monocular depth estimation tasks on the NYUv2 and KITTI datasets, DPT has also reached a new state - of - the - art level. ### Summary By introducing Dense Vision Transformer (DPT), this paper solves the problems of feature resolution loss and insufficient global context awareness in traditional convolutional networks in dense prediction tasks, thereby significantly improving the performance of dense prediction tasks, especially performing well on both large - scale and small - scale datasets.

Vision Transformers for Dense Prediction

Vision Transformers: From Semantic Segmentation to Dense Prediction

Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning

Vision transformers for dense prediction: A survey

Vision Transformer Adapter for Dense Predictions

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Dense Transformer Networks

A Comprehensive Study of Vision Transformers on Dense Prediction Tasks

Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets

ViT-Adapter: Exploring Plain Vision Transformer for Accurate Dense Predictions

AiluRus: A Scalable ViT Framework for Dense Prediction

Rethinking Local and Global Feature Representation for Dense Prediction

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

Denoising Vision Transformers

Vision Transformer with Progressive Sampling

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

Win-Win: Training High-Resolution Vision Transformers from Two Windows

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks

Three things everyone should know about Vision Transformers