Numerical Pruning for Efficient Autoregressive Models

Xuan Shen,Zhao Song,Yufa Zhou,Bo Chen,Jing Liu,Ruiyi Zhang,Ryan A. Rossi,Hao Tan,Tong Yu,Xiang Chen,Yufan Zhou,Tong Sun,Pu Zhao,Yanzhi Wang,Jiuxiang Gu
2024-12-17
Abstract:Transformers have emerged as the leading architecture in deep learning, proving to be versatile and highly effective across diverse domains beyond language and image processing. However, their impressive performance often incurs high computational costs due to their substantial model size. This paper focuses on compressing decoder-only transformer-based autoregressive models through structural weight pruning to improve the model efficiency while preserving performance for both language and image generation tasks. Specifically, we propose a training-free pruning method that calculates a numerical score with Newton's method for the Attention and MLP modules, respectively. Besides, we further propose another compensation algorithm to recover the pruned model for better performance. To verify the effectiveness of our method, we provide both theoretical support and extensive experiments. Our experiments show that our method achieves state-of-the-art performance with reduced memory usage and faster generation speeds on GPUs.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the computational efficiency of Transformer - based autoregressive models (especially decoder - only architectures) in language generation and image generation tasks while maintaining model performance. Specifically, the paper focuses on the following points: 1. **The problem of high computational cost**: Although Transformer models perform well in a variety of tasks, their large model sizes lead to high computational costs. Therefore, an effective method is needed to compress these models to reduce the consumption of computational resources. 2. **The application of structured pruning**: Compared with unstructured pruning methods, structured pruning can provide more effective reduction in terms of computational and memory overhead. However, most of the existing compression methods mainly focus on language models and are of limited effectiveness for image generation tasks because there are fundamental differences in data structures and computational requirements between language and image processing. 3. **Model recovery after pruning**: After pruning, how to effectively restore model performance is an important issue. Fully retraining large autoregressive models is usually computationally infeasible. Therefore, a lightweight compensation algorithm needs to be developed to adjust the remaining weights to compensate for the losses caused by the pruned weights. To solve the above problems, the paper proposes a new structured pruning method, combining numerical scoring and compensation techniques, aiming to achieve the following goals: - **Propose a numerical scoring method**: Calculate numerical scores for each layer by solving the optimal pruning mask using Newton's method to minimize the pruning error. - **Introduce a compensation algorithm**: Compensate for the performance loss caused by pruning by updating the remaining weights, further improving the task performance of the pruned model. - **Verify the effectiveness of the method**: Through theoretical support and extensive experiments, it is proved that this method can achieve state - of - the - art performance in both language generation and image generation tasks, and reduces GPU memory usage and accelerates the generation speed. In summary, this paper aims to improve the efficiency and performance of Transformer models in different tasks through innovative pruning and compensation strategies.