HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs

Ting Yao,Yehao Li,Yingwei Pan,Tao Mei
2024-03-19
Abstract:The hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks. Scaling up the input resolution of such hybrid backbones naturally strengthes model capacity, but inevitably suffers from heavy computational cost that scales quadratically. Instead, we present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs. HIRI-ViT is built upon the seminal idea of decomposing the typical CNN operations into two parallel CNN branches in a cost-efficient manner. One high-resolution branch directly takes primary high-resolution features as inputs, but uses less convolution operations. The other low-resolution branch first performs down-sampling and then utilizes more convolution operations over such low-resolution features. Experiments on both recognition task (ImageNet-1K dataset) and dense prediction tasks (COCO and ADE20K datasets) demonstrate the superiority of HIRI-ViT. More remarkably, under comparable computational cost ($\sim$5.0 GFLOPs), HIRI-ViT achieves to-date the best published Top-1 accuracy of 84.3% on ImageNet with 448$\times$448 inputs, which absolutely improves 83.4% of iFormer-S by 0.9% with 224$\times$224 inputs.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
This paper aims to solve the problem that the computational cost of visual task models (especially the hybrid models combining Vision Transformer and Convolutional Neural Network) rises sharply when the input resolution increases. Specifically, directly enlarging the input resolution can enhance the model capacity, but it will lead to a quadratic increase in computational cost, which is a major challenge in practical applications. For example, when the input resolution is increased from 224×224 to 384×384, although the Top - 1 accuracy of Swin Transformer is improved, its computational cost is significantly increased. To solve this problem, the author proposes a new hybrid backbone network - HIRI - ViT (HIgh - Resolution Inputs Vision Transformer). It efficiently processes high - resolution inputs by upgrading the traditional four - stage ViT to a five - stage ViT and using dual - branch building blocks in the early stages. These dual - branch building blocks include a high - resolution branch and a low - resolution branch. The former directly receives high - resolution features as input but uses fewer convolution operations; the latter first performs down - sampling and then performs more convolution operations on low - resolution features. This design not only retains the enhancement of model capacity brought by high - resolution inputs but also significantly reduces the computational cost of each branch through a lightweight design. The experimental results show that while maintaining a similar computational cost (about 5.0 GFLOPs), the Top - 1 accuracy of HIRI - ViT on the ImageNet dataset reaches 84.3%, which is better than other existing models. In addition, HIRI - ViT also performs well in dense prediction tasks such as object detection and semantic segmentation. In conclusion, the main contribution of this paper is to propose a method to effectively expand the CNN + ViT hybrid backbone network to support high - resolution inputs while keeping the computational cost controllable. By decomposing the traditional CNN operations into two parallel lightweight CNN branches, the effective processing of high - resolution inputs is achieved.