Abstract:Transformers have become the standard in state-of-the-art vision architectures, achieving impressive performance on both image-level and dense pixelwise tasks. However, training vision transformers for high-resolution pixelwise tasks has a prohibitive cost. Typical solutions boil down to hierarchical architectures, fast and approximate attention, or training on low-resolution crops. This latter solution does not constrain architectural choices, but it leads to a clear performance drop when testing at resolutions significantly higher than that used for training, thus requiring ad-hoc and slow post-processing schemes. In this paper, we propose a novel strategy for efficient training and inference of high-resolution vision transformers. The key principle is to mask out most of the high-resolution inputs during training, keeping only N random windows. This allows the model to learn local interactions between tokens inside each window, and global interactions between tokens from different windows. As a result, the model can directly process the high-resolution input at test time without any special trick. We show that this strategy is effective when using relative positional embedding such as rotary embeddings. It is 4 times faster to train than a full-resolution network, and it is straightforward to use at test time compared to existing approaches. We apply this strategy to three dense prediction tasks with high-resolution data. First, we show on the task of semantic segmentation that a simple setting with 2 windows performs best, hence the name of our method: Win-Win. Second, we confirm this result on the task of monocular depth prediction. Third, we further extend it to the binocular task of optical flow, reaching state-of-the-art performance on the Spring benchmark that contains Full-HD images with an order of magnitude faster inference than the best competitor.

Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation

Boost Supervised Pretraining for Visual Transfer Learning: Implications of Self-Supervised Contrastive Representation Learning.

Vision Transformers for Dense Prediction

Self-supervised Vision Transformers for Land-cover Segmentation and Classification

On Efficient Transformer-Based Image Pre-training for Low-Level Vision

Vision Transformers: From Semantic Segmentation to Dense Prediction

Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry

RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training.

A Closer Look at Self-Supervised Lightweight Vision Transformers

Dense Contrastive Learning for Self-Supervised Visual Pre-Training

Vision transformers for dense prediction: A survey

Long-Short Temporal Contrastive Learning of Video Transformers

Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation

Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning

DenseCL: A Simple Framework for Self-Supervised Dense Visual Pre-Training

Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations

Self-supervised Models are Good Teaching Assistants for Vision Transformers.

Patch-level Representation Learning for Self-supervised Vision Transformers

Win-Win: Training High-Resolution Vision Transformers from Two Windows

A Comprehensive Study of Vision Transformers on Dense Prediction Tasks