Abstract:Vision Transformer (ViT) has shown great potential for various visual tasks due to its ability to model long-range dependency. However, ViT requires a large amount of computing resource to compute the global self-attention. In this work, we propose a ladder self-attention block with multiple branches and a progressive shift mechanism to develop a light-weight transformer backbone that requires less computing resources (e.g. a relatively small number of parameters and FLOPs), termed Progressive Shift Ladder Transformer (PSLT). First, the ladder self-attention block reduces the computational cost by modelling local self-attention in each branch. In the meanwhile, the progressive shift mechanism is proposed to enlarge the receptive field in the ladder self-attention block by modelling diverse local self-attention for each branch and interacting among these branches. Second, the input feature of the ladder self-attention block is split equally along the channel dimension for each branch, which considerably reduces the computational cost in the ladder self-attention block (with nearly 1/3 the amount of parameters and FLOPs), and the outputs of these branches are then collaborated by a pixel-adaptive fusion. Therefore, the ladder self-attention block with a relatively small number of parameters and FLOPs is capable of modelling long-range interactions. Based on the ladder self-attention block, PSLT performs well on several vision tasks, including image classification, objection detection and person re-identification. On the ImageNet-1k dataset, PSLT achieves a top-1 accuracy of 79.9% with 9.2M parameters and 1.9G FLOPs, which is comparable to several existing models with more than 20M parameters and 4G FLOPs. Code is available at <a class="link-external link-https" href="https://isee-ai.cn/wugaojie/PSLT.html" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to develop a lightweight Vision Transformer to reduce the demand for computational resources while maintaining high performance. Specifically, the paper proposes a new architecture called **Progressive Shift Ladder Transformer (PSLT)**. #### Main Objectives 1. **Reduce Computational Resource Demand**: By designing a lightweight transformer with fewer parameters and floating-point operations (FLOPs), making it deployable on devices with limited computational resources. 2. **Expand Receptive Field**: By introducing multi-branch ladder self-attention blocks and a progressive shift mechanism to expand the receptive field, thereby capturing long-range dependencies. 3. **Improve Versatility**: Combining the advantages of Convolutional Neural Networks (CNNs) in the early stages and the capabilities of self-attention mechanisms in the later stages, enabling the model to perform well in various visual tasks. #### Specific Methods - **Ladder Self-Attention Block**: Divides the input feature map into multiple equal parts and performs local self-attention calculations in each branch. This design significantly reduces the number of parameters and computational cost. - **Progressive Shift Mechanism**: By passing output features between different branches, pixels within different windows can interact with each other, thereby expanding the receptive field. - **Pixel-Adaptive Fusion Module**: Fuses the output features of different branches with adaptive weights, further improving the model's performance. #### Experimental Results - On the ImageNet-1k dataset, PSLT achieved a Top-1 accuracy of 79.9% with only 9.2M parameters and 1.9G FLOPs, which is comparable to some existing models with over 20M parameters and 4G FLOPs. In summary, this paper successfully addresses the high computational resource demand of traditional Vision Transformers by proposing a novel lightweight Vision Transformer architecture, which performs excellently in various visual tasks.

PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift

ViT-LSLA: Vision Transformer with Light Self-Limited-Attention

Lite Vision Transformer with Enhanced Self-Attention

Light-Weight Vision Transformer with Parallel Local and Global Self-Attention

Vision Transformer with Progressive Sampling

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Vision Transformer with Sparse Scan Prior

Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

Fast Vision Transformers with HiLo Attention

FViT: A Focal Vision Transformer with Gabor Filter

SepViT: Separable Vision Transformer

LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution

Vision Transformer with Super Token Sampling

PLG-ViT: Vision Transformer with Parallel Local and Global Self-Attention

SimViT: Exploring a Simple Vision Transformer with sliding windows

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Vision Transformers: From Semantic Segmentation to Dense Prediction

PPT: Token Pruning and Pooling for Efficient Vision Transformers

You Only Need Less Attention at Each Stage in Vision Transformers

DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition

Super Vision Transformer