Abstract:Large Language Models (LLMs), renowned for their remarkable performance across diverse domains, present a challenge when it comes to practical deployment due to their colossal model size. In response to this challenge, efforts have been directed toward the application of traditional network pruning techniques to LLMs, uncovering a massive number of parameters that can be pruned in one-shot without hurting performance. Prevailing LLM pruning strategies have consistently adhered to the practice of uniformly pruning all layers at equivalent sparsity, resulting in robust performance. However, this observation stands in contrast to the prevailing trends observed in the field of vision models, where non-uniform layerwise sparsity typically yields stronger results. To understand the underlying reasons for this disparity, we conduct a comprehensive study and discover a strong correlation with the emergence of activation outliers in LLMs. Inspired by this finding, we introduce a novel LLM pruning methodology that incorporates a tailored set of non-uniform layerwise sparsity ratios, termed as Outlier Weighed Layerwise sparsity (OWL). The sparsity ratio of OWL is proportional to the outlier ratio observed within each layer, facilitating a more effective alignment between layerwise weight sparsity and outlier ratios. Our empirical evaluation, conducted across the LLaMA-V1 family and OPT, spanning various benchmarks, demonstrates the distinct advantages offered by OWL over previous methods. For instance, OWL exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%, respectively, while delivering 2.6x end-to-end inference speed-up in the DeepSparse inference engine. Codes are available at

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the challenges brought by large - scale language models (LLMs) in practical deployment due to their large model sizes. Specifically, although LLMs perform excellently in various applications, their huge model sizes lead to significant computational requirements and resource consumption, which raise major concerns in terms of financial expenditure and environmental impact. Therefore, researchers have explored traditional network pruning techniques to reduce the number of parameters in LLMs, aiming to achieve efficient model compression without sacrificing performance. However, existing LLM pruning strategies usually adopt uniform inter - layer sparsity (that is, each layer is pruned with the same sparsity). Although this method is effective, it is in contrast to the phenomenon that non - uniform inter - layer sparsity in visual models usually brings better results. For this reason, through comprehensive research, the author discovers that there are activation outliers in LLMs. These outliers refer to output features with significantly larger values compared to other features. Based on this finding, the author proposes a new LLM pruning method - Outlier Weighed Layerwise sparsity (OWL). The core idea of OWL is to determine the inter - layer sparsity according to the proportion of outliers in each layer, thus more effectively coordinating the relationship between the inter - layer weight sparsity and the proportion of outliers. Through empirical evaluation on multiple benchmarks, the paper demonstrates the advantages of OWL over existing methods. In particular, at a high sparsity level (such as 70%), OWL reduces parameters while maintaining or even improving model performance, and achieves a 2.6 - fold end - to - end inference acceleration in the DeepSparse inference engine. These results not only prove the importance of non - uniform inter - layer sparsity for LLM pruning, but also provide a new perspective for future research.

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

A Simple and Effective Pruning Approach for Large Language Models

SparseLLM: Towards Global Pruning for Pre-trained Language Models

WRP: Weight Recover Prune for Structured Sparsity

MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

Pruning Foundation Models for High Accuracy without Retraining

Reassessing Layer Pruning in LLMs: New Insights and Methods

Learn To be Efficient: Build Structured Sparsity in Large Language Models

Rethinking Pruning for Vision-Language Models: Strategies for Effective Sparsity and Performance Restoration

Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models

SlimGPT: Layer-wise Structured Pruning for Large Language Models

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models