Abstract:In contrast to moderate-size neural network pruning, structural weight pruning on the Large-Language Models (LLMs) imposes a novel challenge on the efficiency of the pruning algorithms, due to the heavy computation/memory demands of the LLMs. Recent efficient LLM pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically hand-crafted metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method eliminates the back-propagation through the LLM per se during the optimization, requiring only the forward pass of the LLM. We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via a policy gradient estimator without back-propagation. As a result, our method is able to 1) operate at structural granularities of channels, heads, and layers, 2) support global and heterogeneous pruning (i.e., our method automatically determines different redundancy for different layers), and 3) optionally initialize with a metric-based method (for our Bernoulli distributions). Extensive experiments on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral using the C4 and WikiText2 datasets demonstrate that our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. both perplexity and the majority of various zero-shot tasks. Codes will be released.

SparseWAV: Fast and Accurate One-Shot Unstructured Pruning for Large Speech Foundation Models

Task-Agnostic Structured Pruning of Speech Representation Models

Structured Pruning of Self-Supervised Pre-trained Models for Speech Recognition and Understanding

Accurate and Structured Pruning for Efficient Automatic Speech Recognition

Convexity-based Pruning of Speech Representation Models

Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism

Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

Rethinking Pruning for Vision-Language Models: Strategies for Effective Sparsity and Performance Restoration

On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

Pruning Foundation Models for High Accuracy without Retraining

USM-Lite: Quantization and Sparsity Aware Fine-tuning for Speech Recognition with Universal Speech Models

Structured Pruning of Large Language Models

Model Compression by Iterative Pruning with Knowledge Distillation and Its Application to Speech Enhancement

LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

A Simple and Effective Pruning Approach for Large Language Models

Keyword-Specific Acoustic Model Pruning for Open-Vocabulary Keyword Spotting

Fluctuation-based Adaptive Structured Pruning for Large Language Models

PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition

An Efficient Layer-Wised Beam Pruning Algorithm for Large Vocabulary Continuous Speech Recognition System