Abstract:Sparse computation offers a compelling solution for the inference of Large Language Models (LLMs) in low-resource scenarios by dynamically skipping the computation of inactive neurons. While traditional approaches focus on ReLU-based LLMs, leveraging zeros in activation values, we broaden the scope of sparse LLMs beyond zero activation values. We introduce a general method that defines neuron activation through neuron output magnitudes and a tailored magnitude threshold, demonstrating that non-ReLU LLMs also exhibit sparse activation. To find the most efficient activation function for sparse computation, we propose a systematic framework to examine the sparsity of LLMs from three aspects: the trade-off between sparsity and performance, the predictivity of sparsity, and the hardware affinity. We conduct thorough experiments on LLMs utilizing different activation functions, including ReLU, SwiGLU, ReGLU, and ReLU$^2$. The results indicate that models employing ReLU$^2$ excel across all three evaluation aspects, highlighting its potential as an efficient activation function for sparse LLMs. We will release the code to facilitate future research.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the case of limited resources, how to improve the inference efficiency of large language models (LLMs) through sparse computing. Specifically, the paper explores going beyond the traditional ReLU activation function and finding more efficient activation functions to achieve a higher proportion of sparse activation, thereby optimizing the deployment and inference performance of large language models in low - resource environments. ### Problem Background Although large language models (LLMs) show great potential in deep learning, their inference processes require a large amount of computing and storage resources, which makes them difficult to be deployed in resource - constrained environments. To address this challenge, sparse computing has become a promising direction, reducing resource consumption by dynamically skipping the computation of inactive neurons. ### Main Contributions of the Paper 1. **Expand the Definition of Sparse Activation**: - Traditional methods only focus on neurons with zero activation values, while this paper proposes a new definition based on the output magnitude of neurons and introduces a magnitude threshold to determine whether a neuron is activated. 2. **Systematic Framework for Evaluating Sparse Computing**: - A framework for evaluating the efficiency of sparse computing from three aspects is proposed: the trade - off between sparsity and performance, the predictability of sparsity, and hardware affinity. 3. **Experimental Verification**: - Through experiments on different activation functions (including ReLU, SwiGLU, ReGLU, and ReLU2), it is found that ReLU2 performs best in all three evaluation aspects, showing its potential as an efficient activation function for sparse LLMs. ### Formulas and Key Concepts - **Output Magnitude Distribution**: \[ \text{CETT}(x)=\frac{\left\|\sum_{i\in D}n_i(x)\right\|_2}{\left\|\text{FFN}(x)\right\|_2},\quad D = \{ i\mid\|n_i(x)\|_2 < \epsilon\} \] where $\epsilon$ is a threshold, $D$ is the set of neurons with output magnitudes less than $\epsilon$, and $\|\cdot\|_2$ represents the L2 - norm. - **Sparsity Ratio**: \[ \text{Sparsity Ratio}=\frac{|D|}{d_{ff}} \] where $|D|$ is the number of neurons with output magnitudes less than $\epsilon$, and $d_{ff}$ is the total number of neurons. - **Co - activation Frequency Matrix**: \[ M_{ij}=\frac{\text{co - activation times of the }i\text{-th and }j\text{-th neurons}}{\text{activation times of the }i\text{-th neuron}},\quad M_{ii} = 0 \] It is used to measure the co - activation relationship between neurons. ### Conclusion Through the above research, the paper shows that the ReLU2 activation function performs excellently in sparse computing. It can not only achieve high sparsity while maintaining performance, but also has high predictability and good hardware affinity. Therefore, it is a very promising choice of activation function for sparse LLMs.

ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse LLMs

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

Achieving Sparse Activation in Small Language Models

SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference

Learn To be Efficient: Build Structured Sparsity in Large Language Models

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

Activation Sparsity Opportunities for Compressing General Large Language Models

Training-Free Activation Sparsity in Large Language Models

Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

Dynamic Activation Pitfalls in LLaMA Models: An Empirical Study

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

Sparsity-Accelerated Training for Large Language Models

Activation function optimization method: Learnable series linear units (LSLUs)

ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models

Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control

Learning Activation Functions for Sparse Neural Networks

CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information

Unraveling Babel: Exploring Multilingual Activation Patterns of LLMs and Their Applications

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models