Data-independent Module-aware Pruning for Hierarchical Vision Transformers

Yang He,Joey Tianyi Zhou

2024-04-21

Abstract:Hierarchical vision transformers (ViTs) have two advantages over conventional ViTs. First, hierarchical ViTs achieve linear computational complexity with respect to image size by local self-attention. Second, hierarchical ViTs create hierarchical feature maps by merging image patches in deeper layers for dense prediction. However, existing pruning methods ignore the unique properties of hierarchical ViTs and use the magnitude value as the weight importance. This approach leads to two main drawbacks. First, the "local" attention weights are compared at a "global" level, which may cause some "locally" important weights to be pruned due to their relatively small magnitude "globally". The second issue with magnitude pruning is that it fails to consider the distinct weight distributions of the network, which are essential for extracting coarse to fine-grained features at various hierarchical levels.

Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the issue that existing pruning methods fail to fully utilize the unique properties of Hierarchical Vision Transformers (ViTs), resulting in suboptimal pruning performance. Specifically, existing methods mainly rely on the absolute value of weights for pruning, which presents two major problems: 1. **Local vs. Global Comparison Issue**: Existing pruning methods compare local attention weights on a global scale, which may lead to some locally important weights being incorrectly pruned due to their lower global importance. 2. **Ignoring Weight Distribution Differences**: The weight distribution at different levels is crucial for extracting coarse-to-fine-grained features, but existing pruning methods fail to consider this. To overcome these issues, the authors propose a Data-independent Module-Aware Pruning (DIMAP) method specifically designed for compressing Hierarchical Vision Transformers. This method treats the weights at different levels as a module and evaluates the importance of weights by analyzing their information distortion, ensuring a fair comparison of weight contributions across different levels. Additionally, DIMAP introduces a new weight importance measurement method that is based solely on the weights themselves, independent of input images, thus eliminating the dependency on the patch merging process. Experiments conducted on Swin Transformers of different sizes show that DIMAP can significantly reduce computational and parameter costs while maintaining or even improving model performance. For example, after removing 52.5% of FLOPs and 52.7% of parameters, the top-5 accuracy of Swin-B only decreased by 0.07%; while reducing 33.2% of FLOPs and 33.2% of parameters, the relative top-5 accuracy of Swin-S even increased by 0.8%. These results validate the effectiveness and advantages of DIMAP.

Data-independent Module-aware Pruning for Hierarchical Vision Transformers

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous Structured Pruning for Vision Transformer

Isomorphic Pruning for Vision Models

An Attention-Based Token Pruning Method for Vision Transformers

Width & Depth Pruning for Vision Transformers

Pruning Self-attentions into Convolutional Layers in Single Path

Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training.

A unified pruning framework for vision transformers

Attention Map Guided Transformer Pruning for Edge Device

LPViT: Low-Power Semi-structured Pruning for Vision Transformers

Vision Transformers with Hierarchical Attention

Multi-Scale And Token Mergence: Make Your ViT More Efficient

Rethinking Hierarchies in Pre-trained Plain Vision Transformer

CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction

What Makes for Hierarchical Vision Transformer?

Vision Transformer Pruning Via Matrix Decomposition

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation