Abstract:The ever-increasing large language models (LLMs), though opening a potential path for the upcoming artificial general intelligence, sadly drops a daunting obstacle on the way towards their on-device deployment. As one of the most well-established pre-LLMs approaches in reducing model complexity, network pruning appears to lag behind in the era of LLMs, due mostly to its costly fine-tuning (or re-training) necessity under the massive volumes of model parameter and training data. To close this industry-academia gap, we introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach that slightly updates sparse LLMs without the expensive backpropagation and any weight updates. Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs, in the fashion of performing iterative weight pruning-and-growing on top of sparse LLMs. To accomplish this purpose, DSnoT particularly takes into account the anticipated reduction in reconstruction error for pruning and growing, as well as the variance w.r.t. different input data for growing each weight. This practice can be executed efficiently in linear time since its obviates the need of backpropagation for fine-tuning LLMs. Extensive experiments on LLaMA-V1/V2, Vicuna, and OPT across various benchmarks demonstrate the effectiveness of DSnoT in enhancing the performance of sparse LLMs, especially at high sparsity levels. For instance, DSnoT is able to outperform the state-of-the-art Wanda by 26.79 perplexity at 70% sparsity with LLaMA-7B. Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs. Codes are available at <a class="link-external link-https" href="https://github.com/zyxxmu/DSnoT" rel="external noopener nofollow">this https URL</a>.

Omni-sparsity DNN: Fast Sparsity Optimization for On-Device Streaming E2E ASR Via Supernet

Deep Neural Network Acceleration with Sparse Prediction Layers

SUBP: Soft Uniform Block Pruning for 1 X N Sparse CNNs Multithreading Acceleration

SUBP: Soft Uniform Block Pruning for 1xn Sparse CNNs Multithreading Acceleration

Learning a Dual-Mode Speech Recognition Model via Self-Pruning

TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models

Systolic-Array Deep-Learning Acceleration Exploring Pattern-Indexed Coordinate-Assisted Sparsity for Real-Time On-Device Speech Processing

Extremely Low Footprint End-to-End ASR System for Smart Device

Towards Ultra-Low-Power Neuromorphic Speech Enhancement with Spiking-FullSubNet

A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency

DPSNN: Spiking Neural Network for Low-Latency Streaming Speech Enhancement

Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs

Weight-sharing Supernet for Searching Specialized Acoustic Event Classification Networks Across Device Constraints

SparseVSR: Lightweight and Noise Robust Visual Speech Recognition

Post-Training Sparse Attention with Double Sparsity

Dynamic Encoder Size Based on Data-Driven Layer-wise Pruning for Speech Recognition

SparseByteNN: A Novel Mobile Inference Acceleration Framework Based on Fine-Grained Group Sparsity

Sparsity Meets Robustness: Channel Pruning for the Feynman-Kac Formalism Principled Robust Deep Neural Nets

Weight-importance sparse training in keyword spotting

An efficient pruning and fine-tuning method for deep spiking neural network