Abstract:Vision Transformers (ViTs) have recently made a splash in computer vision domain and achieved state-of-the-art in many vision tasks. Nevertheless, due to their vast model size and high computational costs, rare transformer-based models are adopted in real-world applications. Since the computational costs of attention operation is the square of the input size, some compression methods for the Multi-Head Self-Attention (MHSA) module have been proposed, reducing its FLOPs successfully but almost without parameters reduction. Meanwhile, the number of parameters and computational costs in the Feed-Forward Network (FFN) module exceeds the MHSA larger, while its compression technologies have not been delved deeper. Consequently, we focus our insight on the compression of FFN layer and present a pruning method named Multi-Dimension Compression of Feed-Forward Network in Vision Transformers(MCF), which greatly reduces the model’s parameters and computational costs. Firstly, we identify the critical elements in the output of the FFN module and then employ them to guide the irregular sparsity of this layer, recognizing insignificant elements of FFN layer that have less impact on the output. Successively, to discard the insignificant elements, we transform the irregular sparsity into regular sparsity and prune them, thus reducing the models’ parameters and getting a substantial speed-up during inference. Extensive results on ImageNet-1K validate the effectiveness of our proposed method, which obtains significant parameters and computational costs reduction with almost unimpaired generalization. For example, we compress DeiT-Tiny with 42% reduction in FLOPs and 33% reduction in parameters, almost without losing accuracy on the ImageNet dataset. Further, we verify the effectiveness of our method in the downstream task, using the pruned DeiT-Small as the backbone for the object detection task on the COCO dataset, gaining revenue without compromising its performance.

ReduceFormer: Attention with Tensor Reduction by Summation

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers

Reducing the Transformer Architecture to a Minimum

Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs

Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

Adder Attention for Vision Transformer.

A 28nm 49.7TOPS/W Sparse Transformer Processor with Random-Projection-Based Speculation, Multi-Stationary Dataflow, and Redundant Partial Product Elimination

EcoFormer: Energy-Saving Attention with Linear Complexity

The Inhibitor: ReLU and Addition-Based Attention for Efficient Transformers

Continual Low-Rank Scaled Dot-product Attention

A 28nm 27.5TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing.

Value Residual Learning For Alleviating Attention Concentration In Transformers

TReX- Reusing Vision Transformer's Attention for Efficient Xbar-based Computing

Bitformer: An efficient Transformer with bitwise operation-based attention for Big Data Analytics at low-cost low-precision devices

UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices

An Energy-Efficient Architecture of Approximate Softmax Functions for Transformer in Edge Computing

Lightweight transformer image feature extraction network

Energon: Toward Efficient Acceleration of Transformers Using Dynamic Sparse Attention.

Conv-Attention: A Low Computation Attention Calculation Method for Swin Transformer

Attention Map Guided Transformer Pruning for Edge Device