Win-Win: Training High-Resolution Vision Transformers from Two Windows

Vincent Leroy,Jerome Revaud,Thomas Lucas,Philippe Weinzaepfel

2024-03-22

Abstract:Transformers have become the standard in state-of-the-art vision architectures, achieving impressive performance on both image-level and dense pixelwise tasks. However, training vision transformers for high-resolution pixelwise tasks has a prohibitive cost. Typical solutions boil down to hierarchical architectures, fast and approximate attention, or training on low-resolution crops. This latter solution does not constrain architectural choices, but it leads to a clear performance drop when testing at resolutions significantly higher than that used for training, thus requiring ad-hoc and slow post-processing schemes. In this paper, we propose a novel strategy for efficient training and inference of high-resolution vision transformers. The key principle is to mask out most of the high-resolution inputs during training, keeping only N random windows. This allows the model to learn local interactions between tokens inside each window, and global interactions between tokens from different windows. As a result, the model can directly process the high-resolution input at test time without any special trick. We show that this strategy is effective when using relative positional embedding such as rotary embeddings. It is 4 times faster to train than a full-resolution network, and it is straightforward to use at test time compared to existing approaches. We apply this strategy to three dense prediction tasks with high-resolution data. First, we show on the task of semantic segmentation that a simple setting with 2 windows performs best, hence the name of our method: Win-Win. Second, we confirm this result on the task of monocular depth prediction. Third, we further extend it to the binocular task of optical flow, reaching state-of-the-art performance on the Spring benchmark that contains Full-HD images with an order of magnitude faster inference than the best competitor.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper explores the efficiency issue of training visual Transformer models (for dense prediction tasks) at high resolution. Current methods, such as hierarchical architecture, fast approximate attention, or training at low resolution, have limitations. The researchers propose a new strategy called "Win-Win" that only keeps random N windows in the image during training, allowing the model to learn local and global interactions. This approach enables the model to directly handle high-resolution inputs during testing without requiring any special techniques. The paper demonstrates the effectiveness of this strategy on relative position embeddings (such as rotation embeddings), with a 4x improvement in training speed, halved memory usage, and achieving similar or better performance than full-resolution training on tasks such as semantic segmentation, monocular depth prediction, and stereo optical flow estimation. In particular, Win-Win achieves state-of-the-art performance on the Spring benchmark for optical flow estimation while also having faster inference speed.

Win-Win: Training High-Resolution Vision Transformers from Two Windows

ViTAR: Vision Transformer with Any Resolution

Vision Transformers for Dense Prediction

SepViT: Separable Vision Transformer

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning

Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

Vision Transformers: From Semantic Segmentation to Dense Prediction

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

DiagSWin: A multi-scale vision transformer with diagonal-shaped windows for object detection and segmentation

Three things everyone should know about Vision Transformers

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets

Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks

ResFormer: Scaling ViTs with Multi-Resolution Training

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

Effective Vision Transformer Training: A Data-Centric Perspective

Beyond Fixation: Dynamic Window Visual Transformer