Abstract:Transformers have emerged as the leading architecture in deep learning, proving to be versatile and highly effective across diverse domains beyond language and image processing. However, their impressive performance often incurs high computational costs due to their substantial model size. This paper focuses on compressing decoder-only transformer-based autoregressive models through structural weight pruning to improve the model efficiency while preserving performance for both language and image generation tasks. Specifically, we propose a training-free pruning method that calculates a numerical score with Newton's method for the Attention and MLP modules, respectively. Besides, we further propose another compensation algorithm to recover the pruned model for better performance. To verify the effectiveness of our method, we provide both theoretical support and extensive experiments. Our experiments show that our method achieves state-of-the-art performance with reduced memory usage and faster generation speeds on GPUs.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the computational efficiency of Transformer - based autoregressive models (especially decoder - only architectures) in language generation and image generation tasks while maintaining model performance. Specifically, the paper focuses on the following points: 1. **The problem of high computational cost**: Although Transformer models perform well in a variety of tasks, their large model sizes lead to high computational costs. Therefore, an effective method is needed to compress these models to reduce the consumption of computational resources. 2. **The application of structured pruning**: Compared with unstructured pruning methods, structured pruning can provide more effective reduction in terms of computational and memory overhead. However, most of the existing compression methods mainly focus on language models and are of limited effectiveness for image generation tasks because there are fundamental differences in data structures and computational requirements between language and image processing. 3. **Model recovery after pruning**: After pruning, how to effectively restore model performance is an important issue. Fully retraining large autoregressive models is usually computationally infeasible. Therefore, a lightweight compensation algorithm needs to be developed to adjust the remaining weights to compensate for the losses caused by the pruned weights. To solve the above problems, the paper proposes a new structured pruning method, combining numerical scoring and compensation techniques, aiming to achieve the following goals: - **Propose a numerical scoring method**: Calculate numerical scores for each layer by solving the optimal pruning mask using Newton's method to minimize the pruning error. - **Introduce a compensation algorithm**: Compensate for the performance loss caused by pruning by updating the remaining weights, further improving the task performance of the pruned model. - **Verify the effectiveness of the method**: Through theoretical support and extensive experiments, it is proved that this method can achieve state - of - the - art performance in both language generation and image generation tasks, and reduces GPU memory usage and accelerates the generation speed. In summary, this paper aims to improve the efficiency and performance of Transformer models in different tasks through innovative pruning and compensation strategies.

Numerical Pruning for Efficient Autoregressive Models

A Fast Post-Training Pruning Framework for Transformers

Class-Aware Pruning for Efficient Neural Networks

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

Joint Structured Pruning and Dense Knowledge Distillation for Efficient Transformer Model Compression

An Attention-Based Token Pruning Method for Vision Transformers

Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference

AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates

X-Pruner: eXplainable Pruning for Vision Transformers

Width & Depth Pruning for Vision Transformers

GAT TransPruning: progressive channel pruning strategy combining graph attention network and transformer

Can pruning make Large Language Models more efficient?

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

Exploring The Neural Burden In Pruned Models: An Insight Inspired By Neuroscience

Know What You Don't Need: Single-Shot Meta-Pruning for Attention Heads

Pruning By Explaining Revisited: Optimizing Attribution Methods to Prune CNNs and Transformers

A Dynamic Pruning Method on Multiple Sparse Structures in Deep Neural Networks

Anonymous Model Pruning for Compressing Deep Neural Networks

SS-Auto: A Single-Shot, Automatic Structured Weight Pruning Framework of DNNs with Ultra-High Efficiency