The Need for Speed: Pruning Transformers with One Recipe

Samir Khaki,Konstantinos N. Plataniotis
2024-03-27
Abstract:We introduce the $\textbf{O}$ne-shot $\textbf{P}$runing $\textbf{T}$echnique for $\textbf{I}$nterchangeable $\textbf{N}$etworks ($\textbf{OPTIN}$) framework as a tool to increase the efficiency of pre-trained transformer architectures $\textit{without requiring re-training}$. Recent works have explored improving transformer efficiency, however often incur computationally expensive re-training procedures or depend on architecture-specific characteristics, thus impeding practical wide-scale adoption. To address these shortcomings, the OPTIN framework leverages intermediate feature distillation, capturing the long-range dependencies of model parameters (coined $\textit{trajectory}$), to produce state-of-the-art results on natural language, image classification, transfer learning, and semantic segmentation tasks $\textit{without re-training}$. Given a FLOP constraint, the OPTIN framework will compress the network while maintaining competitive accuracy performance and improved throughput. Particularly, we show a $\leq 2$% accuracy degradation from NLP baselines and a $0.5$% improvement from state-of-the-art methods on image classification at competitive FLOPs reductions. We further demonstrate the generalization of tasks and architecture with comparative performance using Mask2Former for semantic segmentation and cnn-style networks. OPTIN presents one of the first one-shot efficient frameworks for compressing transformer architectures that generalizes well across different class domains, in particular: natural language and image-related tasks, without $\textit{re-training}$.
Machine Learning
What problem does this paper attempt to address?
The paper aims to address the efficient compression of pre-trained transformer architectures (such as those used in natural language processing and image classification tasks) across different domains without the need for retraining. Specifically, the paper proposes a new framework called OPTIN (One-shot Pruning Technique for Interchangeable Networks) to reduce the number of floating-point operations (FLOPs) without degrading performance, thereby improving model efficiency. The main issues the paper attempts to solve are as follows: 1. **Model Compression**: Current transformer models, while powerful, are computationally expensive, especially when applied in resource-constrained environments such as edge devices. Therefore, a general method is needed to compress these models to reduce computational demands. 2. **No Retraining Required**: Many existing model compression methods require retraining, which not only consumes a lot of time and computational resources but is also impractical for models that have already undergone expensive training. Thus, the paper proposes a one-shot compression method that achieves efficient model compression without the need for retraining. 3. **Cross-domain Generality**: Existing compression methods are often tailored to specific tasks or architectures and lack broad applicability. The goal of the paper is to develop a general compression framework that can be applied to various tasks and architectures. 4. **Competitive Performance**: Despite the reduction in computational load, the compressed model must still maintain performance levels comparable to the baseline model. The paper demonstrates that the compressed models remain competitively accurate across various tasks. With these improvements, the OPTIN framework can achieve efficient compression across different tasks and architectures, and it performs well in multiple benchmark tests, proving its effectiveness and broad applicability.