PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift

Gaojie Wu,Wei-Shi Zheng,Yutong Lu,Qi Tian
DOI: https://doi.org/10.1109/TPAMI.2023.3265499
2023-04-07
Abstract:Vision Transformer (ViT) has shown great potential for various visual tasks due to its ability to model long-range dependency. However, ViT requires a large amount of computing resource to compute the global self-attention. In this work, we propose a ladder self-attention block with multiple branches and a progressive shift mechanism to develop a light-weight transformer backbone that requires less computing resources (e.g. a relatively small number of parameters and FLOPs), termed Progressive Shift Ladder Transformer (PSLT). First, the ladder self-attention block reduces the computational cost by modelling local self-attention in each branch. In the meanwhile, the progressive shift mechanism is proposed to enlarge the receptive field in the ladder self-attention block by modelling diverse local self-attention for each branch and interacting among these branches. Second, the input feature of the ladder self-attention block is split equally along the channel dimension for each branch, which considerably reduces the computational cost in the ladder self-attention block (with nearly 1/3 the amount of parameters and FLOPs), and the outputs of these branches are then collaborated by a pixel-adaptive fusion. Therefore, the ladder self-attention block with a relatively small number of parameters and FLOPs is capable of modelling long-range interactions. Based on the ladder self-attention block, PSLT performs well on several vision tasks, including image classification, objection detection and person re-identification. On the ImageNet-1k dataset, PSLT achieves a top-1 accuracy of 79.9% with 9.2M parameters and 1.9G FLOPs, which is comparable to several existing models with more than 20M parameters and 4G FLOPs. Code is available at <a class="link-external link-https" href="https://isee-ai.cn/wugaojie/PSLT.html" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to develop a lightweight Vision Transformer to reduce the demand for computational resources while maintaining high performance. Specifically, the paper proposes a new architecture called **Progressive Shift Ladder Transformer (PSLT)**. #### Main Objectives 1. **Reduce Computational Resource Demand**: By designing a lightweight transformer with fewer parameters and floating-point operations (FLOPs), making it deployable on devices with limited computational resources. 2. **Expand Receptive Field**: By introducing multi-branch ladder self-attention blocks and a progressive shift mechanism to expand the receptive field, thereby capturing long-range dependencies. 3. **Improve Versatility**: Combining the advantages of Convolutional Neural Networks (CNNs) in the early stages and the capabilities of self-attention mechanisms in the later stages, enabling the model to perform well in various visual tasks. #### Specific Methods - **Ladder Self-Attention Block**: Divides the input feature map into multiple equal parts and performs local self-attention calculations in each branch. This design significantly reduces the number of parameters and computational cost. - **Progressive Shift Mechanism**: By passing output features between different branches, pixels within different windows can interact with each other, thereby expanding the receptive field. - **Pixel-Adaptive Fusion Module**: Fuses the output features of different branches with adaptive weights, further improving the model's performance. #### Experimental Results - On the ImageNet-1k dataset, PSLT achieved a Top-1 accuracy of 79.9% with only 9.2M parameters and 1.9G FLOPs, which is comparable to some existing models with over 20M parameters and 4G FLOPs. In summary, this paper successfully addresses the high computational resource demand of traditional Vision Transformers by proposing a novel lightweight Vision Transformer architecture, which performs excellently in various visual tasks.