Brain-Inspired Stepwise Patch Merging for Vision Transformers

Yonghao Yu,Dongcheng Zhao,Guobin Shen,Yiting Dong,Yi Zeng
2024-09-11
Abstract:The hierarchical architecture has become a mainstream design paradigm for Vision Transformers (ViTs), with Patch Merging serving as the pivotal component that transforms a columnar architecture into a hierarchical one. Drawing inspiration from the brain's ability to integrate global and local information for comprehensive visual understanding, we propose a novel technique called Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism's ability to 'see' better. SPM comprises two critical modules: Multi-Scale Aggregation (MSA) and Guided Local Enhancement (GLE). The MSA module integrates multi-scale features to enrich feature representation, while the GLE module focuses on refining local detail extraction, thus achieving an optimal balance between long-range dependency modeling and local feature enhancement. Extensive experiments conducted on benchmark datasets, including ImageNet-1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models, particularly in dense prediction tasks such as object detection and semantic segmentation. These results underscore the efficacy of SPM in enhancing model accuracy and robustness across a wide range of computer vision tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to integrate global and local information more effectively in Vision Transformer to improve the performance of the model in dense prediction tasks (such as object detection and semantic segmentation). Specifically, the paper proposes a new technique - Stepwise Patch Merging (SPM), aiming to overcome the limitations of existing methods in modeling long - distance dependencies and preserving spatial information. ### Background of the Main Problem 1. **Limitations of Vision Transformer (ViT)**: - Although ViT performs well in tasks such as image classification, in dense prediction tasks (such as object detection and semantic segmentation), its performance is not as good as that of Convolutional Neural Networks (CNNs). This is mainly because ViT lacks the ability to represent hierarchical features. 2. **Deficiencies of Existing Patch Merging Methods**: - Fixed - grid methods (such as PVT and Swin) can generate hierarchical feature maps, but their single and relatively small receptive fields limit their ability to model geometric transformations. - Dynamic feature methods (such as DynamicViT, HAFA) adaptively extract features, but may lose valuable information and usually do not support end - to - end training. ### Proposed Solution To solve the above problems, the paper proposes the **Stepwise Patch Merging (SPM)** framework, which contains two key modules: 1. **Multi - Scale Aggregation (MSA)**: - Expand the receptive field through deep convolution and large convolution kernels to capture long - distance dependencies. - Use Channel Shuffle and linear projection to fuse multi - scale features. - The formulas are as follows: \[ H_n = DWConv_{k_n\times k_n}(x_n) \] \[ G_c = W_c([H_c^1; H_c^2;...; H_c^N]) \] \[ MSA(X) = W([G_1; G_2;...; G_{C/N}]) \] 2. **Guided Local Enhancement (GLE)**: - Introduce context - aware guide tokens, and optimize local feature extraction through the self - attention mechanism. - Use large convolution kernels to generate guide tokens to ensure that local features are both accurate and context - related. - The formulas are as follows: \[ GTG(X) = DWConv(GELU(BatchNorm(X))) \] \[ z = [S(i,j)\sim GTG(X); S_1(i,j)\sim\rho(i,j);...; S_{k^2}(i,j)\sim\rho(i,j)] \] \[ [q, k, v] = zU_{qkv} \] \[ SA(z)=\text{softmax}\left(\frac{qk^{\top}}{\sqrt{C}}\right)v \] \[ GLE(X(i,j)) = A(i,j)\sim GTG(X) \] ### Experimental Results - **Image Classification**: Experiments on the ImageNet - 1K dataset show that SPM significantly improves the classification accuracy of models of different scales. - **Object Detection**: Experiments on the COCO dataset show that SPM improves by 4.1%, 2.6% and 1.3% in object detection and instance segmentation tasks respectively. - **Semantic Segmentation**: Experiments on the ADE20K dataset show that SPM has made significant improvements in semantic segmentation tasks, especially for small and large objects.