A Novel Shortcut between Local Windows for Window-based Vision Transformers

Shanshan Wan,Yingmei Wei,Lai Kang,Beibei Han,Tianrui Shen,Zanxi Ruan
DOI: https://doi.org/10.1109/BigDIA60676.2023.10429754
2023-01-01
Abstract:Window-based self-attention (WSA) has been proved to be an effective way to reduce a transformer block’s complexity in window-based transformers. However, the relations between local windows cannot be learned in the WSA module. Thus, some recent transformers create cross-window connections by proposing new attention modules or adding convolutions in the transformer block, while still adopt the WSA module with limited reception field. To provide global information for all transformer blocks in window-based transformers, we propose a local-window shortcut (LWS) which is in parallel with the residual shortcut. LWS contains both spatial and channel transformations. The channel transformations are two linear layers, which not only minimize the parameters and computational cost for LWS but also increase the feature diversity. The global spatial transformation is achieved through an ACDC layer, which is much lighter than a fully connected layer. Extensive experiments conducted on two public datasets (CIFAR100 and Tiny-ImageNet) and a customized mini-ImageNet dataset demonstrate that LWS with suitable window size can greatly boost the performance of some SOTA window-based transformers (up to 1.6% accuracy improvement on CIFAR100), while the parameters and FLOPs increase is negligible.
What problem does this paper attempt to address?