Abstract:The hierarchical architecture has become a mainstream design paradigm for Vision Transformers (ViTs), with Patch Merging serving as the pivotal component that transforms a columnar architecture into a hierarchical one. Drawing inspiration from the brain's ability to integrate global and local information for comprehensive visual understanding, we propose a novel technique called Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism's ability to 'see' better. SPM comprises two critical modules: Multi-Scale Aggregation (MSA) and Guided Local Enhancement (GLE). The MSA module integrates multi-scale features to enrich feature representation, while the GLE module focuses on refining local detail extraction, thus achieving an optimal balance between long-range dependency modeling and local feature enhancement. Extensive experiments conducted on benchmark datasets, including ImageNet-1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models, particularly in dense prediction tasks such as object detection and semantic segmentation. These results underscore the efficacy of SPM in enhancing model accuracy and robustness across a wide range of computer vision tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to integrate global and local information more effectively in Vision Transformer to improve the performance of the model in dense prediction tasks (such as object detection and semantic segmentation). Specifically, the paper proposes a new technique - Stepwise Patch Merging (SPM), aiming to overcome the limitations of existing methods in modeling long - distance dependencies and preserving spatial information. ### Background of the Main Problem 1. **Limitations of Vision Transformer (ViT)**: - Although ViT performs well in tasks such as image classification, in dense prediction tasks (such as object detection and semantic segmentation), its performance is not as good as that of Convolutional Neural Networks (CNNs). This is mainly because ViT lacks the ability to represent hierarchical features. 2. **Deficiencies of Existing Patch Merging Methods**: - Fixed - grid methods (such as PVT and Swin) can generate hierarchical feature maps, but their single and relatively small receptive fields limit their ability to model geometric transformations. - Dynamic feature methods (such as DynamicViT, HAFA) adaptively extract features, but may lose valuable information and usually do not support end - to - end training. ### Proposed Solution To solve the above problems, the paper proposes the **Stepwise Patch Merging (SPM)** framework, which contains two key modules: 1. **Multi - Scale Aggregation (MSA)**: - Expand the receptive field through deep convolution and large convolution kernels to capture long - distance dependencies. - Use Channel Shuffle and linear projection to fuse multi - scale features. - The formulas are as follows: \[ H_n = DWConv_{k_n\times k_n}(x_n) \] \[ G_c = W_c([H_c^1; H_c^2;...; H_c^N]) \] \[ MSA(X) = W([G_1; G_2;...; G_{C/N}]) \] 2. **Guided Local Enhancement (GLE)**: - Introduce context - aware guide tokens, and optimize local feature extraction through the self - attention mechanism. - Use large convolution kernels to generate guide tokens to ensure that local features are both accurate and context - related. - The formulas are as follows: \[ GTG(X) = DWConv(GELU(BatchNorm(X))) \] \[ z = [S(i,j)\sim GTG(X); S_1(i,j)\sim\rho(i,j);...; S_{k^2}(i,j)\sim\rho(i,j)] \] \[ [q, k, v] = zU_{qkv} \] \[ SA(z)=\text{softmax}\left(\frac{qk^{\top}}{\sqrt{C}}\right)v \] \[ GLE(X(i,j)) = A(i,j)\sim GTG(X) \] ### Experimental Results - **Image Classification**: Experiments on the ImageNet - 1K dataset show that SPM significantly improves the classification accuracy of models of different scales. - **Object Detection**: Experiments on the COCO dataset show that SPM improves by 4.1%, 2.6% and 1.3% in object detection and instance segmentation tasks respectively. - **Semantic Segmentation**: Experiments on the ADE20K dataset show that SPM has made significant improvements in semantic segmentation tasks, especially for small and large objects.

Brain-Inspired Stepwise Patch Merging for Vision Transformers

MSPE: Multi-Scale Patch Embedding Prompts Vision Transformers to Any Resolution

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

Vision Transformers with Patch Diversification

Improve Vision Transformers Training by Suppressing Over-smoothing

Retina Vision Transformer (RetinaViT): Introducing Scaled Patches into Vision Transformers

Synergistic Patch Pruning for Vision Transformer: Unifying Intra- & Inter-Layer Patch Importance

Vision Transformer with Sparse Scan Prior

SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers

Spatial-Enhanced Multi-Level Wavelet Patching in Vision Transformers

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Patch Slimming for Efficient Vision Transformers

MPViT: Multi-Path Vision Transformer for Dense Prediction

FAM: Improving columnar vision transformer with feature attention mechanism

Merging Vision Transformers from Different Tasks and Domains

A Bio-Inspired Visual Perception Transformer for Cross-Domain Semantic Segmentation of High-Resolution Remote Sensing Images

Exploring and Improving Mobile Level Vision Transformers

EAPT: Efficient Attention Pyramid Transformer for Image Processing

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Patcher: Patch Transformers with Mixture of Experts for Precise Medical Image Segmentation

Convolutional Embedding Makes Hierarchical Vision Transformer Stronger