The Need for Speed: Pruning Transformers with One Recipe

Samir Khaki,Konstantinos N. Plataniotis

2024-03-27

Abstract:We introduce the $\textbf{O}$ne-shot $\textbf{P}$runing $\textbf{T}$echnique for $\textbf{I}$nterchangeable $\textbf{N}$etworks ($\textbf{OPTIN}$) framework as a tool to increase the efficiency of pre-trained transformer architectures $\textit{without requiring re-training}$. Recent works have explored improving transformer efficiency, however often incur computationally expensive re-training procedures or depend on architecture-specific characteristics, thus impeding practical wide-scale adoption. To address these shortcomings, the OPTIN framework leverages intermediate feature distillation, capturing the long-range dependencies of model parameters (coined $\textit{trajectory}$), to produce state-of-the-art results on natural language, image classification, transfer learning, and semantic segmentation tasks $\textit{without re-training}$. Given a FLOP constraint, the OPTIN framework will compress the network while maintaining competitive accuracy performance and improved throughput. Particularly, we show a $\leq 2$% accuracy degradation from NLP baselines and a $0.5$% improvement from state-of-the-art methods on image classification at competitive FLOPs reductions. We further demonstrate the generalization of tasks and architecture with comparative performance using Mask2Former for semantic segmentation and cnn-style networks. OPTIN presents one of the first one-shot efficient frameworks for compressing transformer architectures that generalizes well across different class domains, in particular: natural language and image-related tasks, without $\textit{re-training}$.

Machine Learning

What problem does this paper attempt to address?

The paper aims to address the efficient compression of pre-trained transformer architectures (such as those used in natural language processing and image classification tasks) across different domains without the need for retraining. Specifically, the paper proposes a new framework called OPTIN (One-shot Pruning Technique for Interchangeable Networks) to reduce the number of floating-point operations (FLOPs) without degrading performance, thereby improving model efficiency. The main issues the paper attempts to solve are as follows: 1. **Model Compression**: Current transformer models, while powerful, are computationally expensive, especially when applied in resource-constrained environments such as edge devices. Therefore, a general method is needed to compress these models to reduce computational demands. 2. **No Retraining Required**: Many existing model compression methods require retraining, which not only consumes a lot of time and computational resources but is also impractical for models that have already undergone expensive training. Thus, the paper proposes a one-shot compression method that achieves efficient model compression without the need for retraining. 3. **Cross-domain Generality**: Existing compression methods are often tailored to specific tasks or architectures and lack broad applicability. The goal of the paper is to develop a general compression framework that can be applied to various tasks and architectures. 4. **Competitive Performance**: Despite the reduction in computational load, the compressed model must still maintain performance levels comparable to the baseline model. The paper demonstrates that the compressed models remain competitively accurate across various tasks. With these improvements, the OPTIN framework can achieve efficient compression across different tasks and architectures, and it performs well in multiple benchmark tests, proving its effectiveness and broad applicability.

The Need for Speed: Pruning Transformers with One Recipe

A Fast Post-Training Pruning Framework for Transformers

Pruning by Training: A Novel Deep Neural Network Compression Framework for Image Processing.

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

Only Train Once: A One-Shot Neural Network Training And Pruning Framework

One-Shot Pruning for Fast-adapting Pre-trained Models on Devices

Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference

Know What You Don't Need: Single-Shot Meta-Pruning for Attention Heads

OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization

Pruning One More Token is Enough: Leveraging Latency-Workload Non-Linearities for Vision Transformers on the Edge

HESSO: Towards Automatic Efficient and User Friendly Any Neural Network Training and Pruning

Fitop-Trans: Maximizing Transformer Pipeline Efficiency Through Fixed-Length Token Pruning on FPGA

OTOv3: Automatic Architecture-Agnostic Neural Network Training and Compression from Structured Pruning to Erasing Operators

OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization

OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition

Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers

Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token Migration

CPOT: Channel Pruning via Optimal Transport

SNP: Structured Neuron-level Pruning to Preserve Attention Scores

Structured Term Pruning for Computational Efficient Neural Networks Inference

Practical Network Acceleration with Tiny Sets: Hypothesis, Theory, and Algorithm