NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models

Jongwoo Ko,Seungjoon Park,Yujin Kim,Sumyeong Ahn,Du-Seong Chang,Euijai Ahn,Se-Young Yun

2023-10-16

Abstract:Structured pruning methods have proven effective in reducing the model size and accelerating inference speed in various network architectures such as Transformers. Despite the versatility of encoder-decoder models in numerous NLP tasks, the structured pruning methods on such models are relatively less explored compared to encoder-only models. In this study, we investigate the behavior of the structured pruning of the encoder-decoder models in the decoupled pruning perspective of the encoder and decoder component, respectively. Our findings highlight two insights: (1) the number of decoder layers is the dominant factor of inference speed, and (2) low sparsity in the pruned encoder network enhances generation quality. Motivated by these findings, we propose a simple and effective framework, NASH, that narrows the encoder and shortens the decoder networks of encoder-decoder models. Extensive experiments on diverse generation and inference tasks validate the effectiveness of our method in both speedup and output quality.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the following issues: 1. **Application of Structured Pruning Methods in Encoder-Decoder Models**: Although structured pruning methods have been proven effective in reducing model size and accelerating inference speed, research on their application in encoder-decoder models is relatively scarce. The paper proposes a new framework—NASH (Narrow Encoder and Shallow Decoder), specifically designed for structured pruning in encoder-decoder models. 2. **Improving Inference Speed and Output Quality**: Through experiments, it was found that the number of decoder layers is the main factor affecting inference speed, while the sparsity of the encoder network is crucial for output quality. Based on these findings, the NASH framework enhances inference speed by reducing the number of decoder layers and improves generation quality by maintaining a low sparsity in the encoder. 3. **Unified Acceleration Framework**: NASH is designed as a general framework applicable to various natural language processing tasks, including summarization, question answering, etc. Experimental results show that NASH can achieve significant acceleration effects across multiple tasks while maintaining high output quality.

NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models

Structured Probabilistic Pruning for Convolutional Neural Network Acceleration.

Joint Structured Pruning and Dense Knowledge Distillation for Efficient Transformer Model Compression

Structured Pruning Learns Compact and Accurate Models

Dynamic Structure Pruning for Compressing CNNs

A Dynamic Pruning Method on Multiple Sparse Structures in Deep Neural Networks

Structured Pruning for Efficient Convolutional Neural Networks Via Incremental Regularization

StructADMM: A Systematic, High-Efficiency Framework of Structured Weight Pruning for DNNs

Adaptive Activation-based Structured Pruning

SS-Auto: A Single-Shot, Automatic Structured Weight Pruning Framework of DNNs with Ultra-High Efficiency

AACP: Model Compression by Accurate and Automatic Channel Pruning.

Structurally Prune Anything: Any Architecture, Any Framework, Any Time

Accurate and Structured Pruning for Efficient Automatic Speech Recognition

Efficient Micro-Structured Weight Unification and Pruning for Neural Network Compression

Structural Pruning of Pre-trained Language Models via Neural Architecture Search

Structured Pruning of Recurrent Neural Networks through Neuron Selection

Layer-adaptive Structured Pruning Guided by Latency

Structured Term Pruning for Computational Efficient Neural Networks Inference

Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations

RL-Pruner: Structured Pruning Using Reinforcement Learning for CNN Compression and Acceleration

AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates