Vision Transformers for Dense Prediction

René Ranftl,Alexey Bochkovskiy,Vladlen Koltun
DOI: https://doi.org/10.48550/arXiv.2103.13413
2021-03-25
Abstract:We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at <a class="link-external link-https" href="https://github.com/intel-isl/DPT" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the key challenges in **Dense Prediction Tasks**. Specifically, the authors introduce a new architecture - **Dense Vision Transformer (DPT)** to replace the traditional convolutional network as the backbone network for dense prediction tasks. The following are the main problems that this paper attempts to solve: 1. **Feature Resolution and Detail Preservation**: - When dealing with dense prediction tasks, convolutional networks usually gradually reduce the resolution of the input image through down - sampling operations to extract multi - scale features. However, this approach will lead to the loss of feature resolution and details in the deep layers, making it difficult for the decoder to recover this information. - The method proposed in the paper uses Vision Transformer (ViT) to avoid explicit down - sampling operations and maintain a constant spatial resolution throughout the processing. This helps to preserve finer - grained features. 2. **Global Context Awareness**: - The local receptive field of convolutional networks limits their ability to capture the global context, especially when dealing with tasks that require global consistency (such as semantic segmentation, monocular depth estimation, etc.). - Vision Transformer can achieve a global receptive field at each stage through the Multi - Head Self - Attention (MHSA) mechanism, thereby better capturing global context information. 3. **Improving the Performance of Dense Prediction Tasks**: - The authors have experimentally verified that DPT significantly outperforms existing convolutional - network - based models in multiple dense prediction tasks, especially when there is a large amount of training data. For example, in the monocular depth estimation task, DPT has a performance improvement of more than 28% compared to the state - of - the - art convolutional - network - based model. - In the semantic segmentation task, DPT has also reached a new state - of - the - art level on the ADE20K dataset, with an mIoU of 49.02%. 4. **The Ability to Adapt to Small - scale Datasets**: - The paper also shows that DPT can be fine - tuned on smaller datasets and still achieve excellent performance. For example, in the monocular depth estimation tasks on the NYUv2 and KITTI datasets, DPT has also reached a new state - of - the - art level. ### Summary By introducing Dense Vision Transformer (DPT), this paper solves the problems of feature resolution loss and insufficient global context awareness in traditional convolutional networks in dense prediction tasks, thereby significantly improving the performance of dense prediction tasks, especially performing well on both large - scale and small - scale datasets.