Abstract:The hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks. Scaling up the input resolution of such hybrid backbones naturally strengthes model capacity, but inevitably suffers from heavy computational cost that scales quadratically. Instead, we present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs. HIRI-ViT is built upon the seminal idea of decomposing the typical CNN operations into two parallel CNN branches in a cost-efficient manner. One high-resolution branch directly takes primary high-resolution features as inputs, but uses less convolution operations. The other low-resolution branch first performs down-sampling and then utilizes more convolution operations over such low-resolution features. Experiments on both recognition task (ImageNet-1K dataset) and dense prediction tasks (COCO and ADE20K datasets) demonstrate the superiority of HIRI-ViT. More remarkably, under comparable computational cost ($\sim$5.0 GFLOPs), HIRI-ViT achieves to-date the best published Top-1 accuracy of 84.3% on ImageNet with 448$\times$448 inputs, which absolutely improves 83.4% of iFormer-S by 0.9% with 224$\times$224 inputs.

What problem does this paper attempt to address?

This paper aims to solve the problem that the computational cost of visual task models (especially the hybrid models combining Vision Transformer and Convolutional Neural Network) rises sharply when the input resolution increases. Specifically, directly enlarging the input resolution can enhance the model capacity, but it will lead to a quadratic increase in computational cost, which is a major challenge in practical applications. For example, when the input resolution is increased from 224×224 to 384×384, although the Top - 1 accuracy of Swin Transformer is improved, its computational cost is significantly increased. To solve this problem, the author proposes a new hybrid backbone network - HIRI - ViT (HIgh - Resolution Inputs Vision Transformer). It efficiently processes high - resolution inputs by upgrading the traditional four - stage ViT to a five - stage ViT and using dual - branch building blocks in the early stages. These dual - branch building blocks include a high - resolution branch and a low - resolution branch. The former directly receives high - resolution features as input but uses fewer convolution operations; the latter first performs down - sampling and then performs more convolution operations on low - resolution features. This design not only retains the enhancement of model capacity brought by high - resolution inputs but also significantly reduces the computational cost of each branch through a lightweight design. The experimental results show that while maintaining a similar computational cost (about 5.0 GFLOPs), the Top - 1 accuracy of HIRI - ViT on the ImageNet dataset reaches 84.3%, which is better than other existing models. In addition, HIRI - ViT also performs well in dense prediction tasks such as object detection and semantic segmentation. In conclusion, the main contribution of this paper is to propose a method to effectively expand the CNN + ViT hybrid backbone network to support high - resolution inputs while keeping the computational cost controllable. By decomposing the traditional CNN operations into two parallel lightweight CNN branches, the effective processing of high - resolution inputs is achieved.

HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs

Convolutional Embedding Makes Hierarchical Vision Transformer Stronger

Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

ViTAR: Vision Transformer with Any Resolution

FasterViT: Fast Vision Transformers with Hierarchical Attention

EViTIB: Efficient Vision Transformer Via Inductive Bias Exploration for Image Super-Resolution

Scaling Vision Transformers

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

DctViT: Discrete Cosine Transform Meet Vision Transformers

HSViT: Horizontally Scalable Vision Transformer

Improving Vision Transformers by Revisiting High-Frequency Components

Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets

HydraViT: Stacking Heads for a Scalable ViT

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

DeepViT: Towards Deeper Vision Transformer

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

Retina Vision Transformer (RetinaViT): Introducing Scaled Patches into Vision Transformers

ACC-ViT : Atrous Convolution's Comeback in Vision Transformers

Auto-scaling Vision Transformers without Training

ResFormer: Scaling ViTs with Multi-Resolution Training