What Makes for Hierarchical Vision Transformer?

Yuxin Fang,Xinggang Wang,Rui Wu,Wenyu Liu
DOI: https://doi.org/10.1109/tpami.2023.3282019
IF: 23.6
2023-01-01
IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract:Recent studies indicate that hierarchical Vision Transformer (ViT) with a macro architecture of interleaved non-overlapped window-based self-attention & shifted-window operation can achieve state-of-the-art performance in various visual recognition tasks, and challenges the ubiquitous convolutional neural networks (CNNs) using densely slid kernels. In most recently proposed hierarchical ViTs, self-attention is the de-facto standard for spatial information aggregation. In this paper, we question whether self-attention is the only choice for hierarchical ViT to attain strong performance, and study the effects of different kinds of cross-window communication methods. To this end, we replace self-attention layers with embarrassingly simple linear mapping layers, and the resulting proof-of-concept architecture termed TransLinear can achieve very strong performance in ImageNet-$\text{1}~k$ image recognition. Moreover, we find that TransLinear is able to leverage the ImageNet pre-trained weights and demonstrates competitive transfer learning properties on downstream dense prediction tasks such as object detection and instance segmentation. We also experiment with other alternatives to self-attention for content aggregation inside each non-overlapped window under different cross-window communication approaches. Our results reveal that the macro architecture, other than specific aggregation layers or cross-window communication mechanisms, is more responsible for hierarchical ViT's strong performance and is the real challenger to the ubiquitous CNN's dense sliding window paradigm.
computer science, artificial intelligence,engineering, electrical & electronic
What problem does this paper attempt to address?