Abstract:Vision Transformers (ViTs) have recently made a splash in computer vision domain and achieved state-of-the-art in many vision tasks. Nevertheless, due to their vast model size and high computational costs, rare transformer-based models are adopted in real-world applications. Since the computational costs of attention operation is the square of the input size, some compression methods for the Multi-Head Self-Attention (MHSA) module have been proposed, reducing its FLOPs successfully but almost without parameters reduction. Meanwhile, the number of parameters and computational costs in the Feed-Forward Network (FFN) module exceeds the MHSA larger, while its compression technologies have not been delved deeper. Consequently, we focus our insight on the compression of FFN layer and present a pruning method named Multi-Dimension Compression of Feed-Forward Network in Vision Transformers(MCF), which greatly reduces the model’s parameters and computational costs. Firstly, we identify the critical elements in the output of the FFN module and then employ them to guide the irregular sparsity of this layer, recognizing insignificant elements of FFN layer that have less impact on the output. Successively, to discard the insignificant elements, we transform the irregular sparsity into regular sparsity and prune them, thus reducing the models’ parameters and getting a substantial speed-up during inference. Extensive results on ImageNet-1K validate the effectiveness of our proposed method, which obtains significant parameters and computational costs reduction with almost unimpaired generalization. For example, we compress DeiT-Tiny with 42% reduction in FLOPs and 33% reduction in parameters, almost without losing accuracy on the ImageNet dataset. Further, we verify the effectiveness of our method in the downstream task, using the pruned DeiT-Small as the backbone for the object detection task on the COCO dataset, gaining revenue without compromising its performance.

Joint Structured Pruning and Dense Knowledge Distillation for Efficient Transformer Model Compression

MCMC: Multi-Constrained Model Compression Via One-Stage Envelope Reinforcement Learning.

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

Comb, Prune, Distill: Towards Unified Pruning for Vision Model Compression

A Survey on Transformer Compression

Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations

Model Compression by Iterative Pruning with Knowledge Distillation and Its Application to Speech Enhancement

Compact Language Models via Pruning and Knowledge Distillation

Structured Pruning Learns Compact and Accurate Models

DiSparse: Disentangled Sparsification for Multitask Model Compression

Joint Dual Feature Distillation and Gradient Progressive Pruning for BERT compression

Efficient Micro-Structured Weight Unification and Pruning for Neural Network Compression

PET: Parameter-efficient Knowledge Distillation on Transformer

Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning

Prune Once for All: Sparse Pre-Trained Language Models

A Dynamic Pruning Method on Multiple Sparse Structures in Deep Neural Networks

Automated Model Compression by Jointly Applied Pruning and Quantization

Knowledge Distillation of Transformer-based Language Models Revisited

Modular Transformers: Compressing Transformers into Modularized Layers for Flexible Efficient Inference

On Compressing Deep Models by Low Rank and Sparse Decomposition.

From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression