Abstract:Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix multiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards older models with ReLU-based sparsity, while others require extensive continued pre-training on up to hundreds of billions of tokens. This paper describes TEAL, a simple training-free method that applies magnitude-based activation sparsity to hidden states throughout the entire model. TEAL achieves 40-50% model-wide sparsity with minimal performance degradation across Llama-2, Llama-3, and Mistral families, with sizes varying from 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock decoding speed-ups of up to 1.53$\times$ and 1.8$\times$ at 40% and 50% model-wide sparsity. TEAL is compatible with weight quantization, enabling further efficiency gains.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve activation sparsity without training in large - scale language models (LLMs) in order to improve the inference speed. Specifically, the paper proposes the TEAL (Training - Free Activation Sparsity in LLMs) method, which introduces sparsity in the hidden states of the model through magnitude - based pruning techniques, thereby reducing the amount of computation and memory movement required in the matrix multiplication process. Existing methods are either targeted at the old ReLU activation function or require a large amount of pre - training data, which limits their wide application. TEAL aims to overcome these limitations and provide a simple method without additional training to achieve sparsity across the model while maintaining the model performance. ### Main problems 1. **Limitations of existing methods**: - **Targeting old models**: Some methods are specifically targeted at old models using the ReLU activation function, while most modern LLMs use other activation functions (such as SwiGLU). - **Requiring a large amount of pre - training**: Other methods require a large amount of continued pre - training of the model, which is impractical in practical applications. 2. **Improving inference efficiency**: - **Reducing computation and memory movement**: By introducing activation sparsity, the amount of computation and memory movement in the matrix multiplication process is reduced, thereby accelerating the inference process. - **Compatibility with quantization techniques**: TEAL is compatible with weight quantization techniques, further improving the inference efficiency of the model. ### Solutions TEAL achieves the above goals through the following steps: 1. **Magnitude - based pruning**: - Define a pruning threshold $ t_p $ such that $ \frac{1}{n} \sum_{i = 1}^n P(|\tilde{x}_i| \leq t_p) = p $, where $ p $ is the sparsity level. - For each activation value $ x_i $, if $ |x_i| \leq t_p $, set it to 0; otherwise, keep it unchanged. 2. **Block - level greedy optimization**: - Through the block - level greedy algorithm, gradually increase the sparsity of each layer until the target sparsity level is reached. - Select the layer with the least impact on the model performance for sparsification. 3. **Hardware - aware acceleration**: - Develop a special sparse GEMV kernel, optimize the memory access pattern, and reduce unnecessary memory transfers. - Use column - major order to store the weight matrix to achieve the best memory coalescing. ### Experimental results - **Performance evaluation**: - Experiments were carried out on the Llama - 2, Llama - 3 and Mistral series models, achieving 40 - 50% model - wide sparsity with minimal performance degradation. - At 40% and 50% sparsity, the decoding speed was improved by 1.53 times and 1.8 times respectively. - **Compatibility with quantization**: - TEAL is compatible with 8 - bit, 4 - bit and 2/3 - bit quantization techniques, further improving the inference efficiency of the model. ### Conclusion TEAL provides a simple and effective training - free method that can achieve activation sparsity in modern LLMs, significantly improve the inference speed while maintaining the model performance. This method is of great significance for large - scale deployment and resource - constrained application scenarios.

Training-Free Activation Sparsity in Large Language Models

Learn To be Efficient: Build Structured Sparsity in Large Language Models

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

Sparsity-Accelerated Training for Large Language Models

Activation Sparsity Opportunities for Compressing General Large Language Models

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models

SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

First Activations Matter: Training-Free Methods for Dynamic Activation in Large Language Models

ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse LLMs

ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

Achieving Sparse Activation in Small Language Models

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

Post-Training Sparse Attention with Double Sparsity

SparseLLM: Towards Global Pruning for Pre-trained Language Models

A Theoretical Explanation of Activation Sparsity Through Flat Minima and Adversarial Robustness