Abstract:Multi-task dense scene understanding is a thriving research domain that requires simultaneous perception and reasoning on a series of correlated tasks with pixel-wise prediction. Most existing works encounter a severe limitation of modeling in the locality due to heavy utilization of convolution operations, while learning interactions and inference in a global spatial-position and multi-task context is critical for this problem. In this paper, we propose a novel end-to-end Inverted Pyramid multi-task Transformer (InvPT) to perform simultaneous modeling of spatial positions and multiple tasks in a unified framework. To the best of our knowledge, this is the first work that explores designing a transformer structure for multi-task dense prediction for scene understanding. Besides, it is widely demonstrated that a higher spatial resolution is remarkably beneficial for dense predictions, while it is very challenging for existing transformers to go deeper with higher resolutions due to huge complexity to large spatial size. InvPT presents an efficient UP-Transformer block to learn multi-task feature interaction at gradually increased resolutions, which also incorporates effective self-attention message passing and multi-scale feature aggregation to produce task-specific prediction at a high resolution. Our method achieves superior multi-task performance on NYUD-v2 and PASCAL-Context datasets respectively, and significantly outperforms previous state-of-the-arts. The code is available at <a class="link-external link-https" href="https://github.com/prismformore/InvPT" rel="external noopener nofollow">this https URL</a>

PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture

PVT v2: Improved baselines with Pyramid Vision Transformer

Nested-TNT: Hierarchical Vision Transformers with Multi-Scale Feature Processing

P2T: Pyramid Pooling Transformer for Scene Understanding

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token Migration

InvPT: Inverted Pyramid Multi-task Transformer for Dense Scene Understanding

QuadTree Attention for Vision Transformers.

PT-Net: Pyramid Transformer Network for Feature Matching Learning

Convolutional Embedding Makes Hierarchical Vision Transformer Stronger

Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding

Demystify Transformers & Convolutions in Modern Image Deep Networks

Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy for Image Recognition without Convolutions

A pyramid transformer with cross-shaped windows for low-light image enhancement

Pyramid Transformer for Traffic Sign Detection

CMT: Convolutional Neural Networks Meet Vision Transformers

TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation

TRT-ViT: TensorRT-oriented Vision Transformer

Cross Pyramid Transformer makes U-net stronger in medical image segmentation

Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention