ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse LLMs

Zhengyan Zhang,Yixin Song,Guanghui Yu,Xu Han,Yankai Lin,Chaojun Xiao,Chenyang Song,Zhiyuan Liu,Zeyu Mi,Maosong Sun
2024-02-06
Abstract:Sparse computation offers a compelling solution for the inference of Large Language Models (LLMs) in low-resource scenarios by dynamically skipping the computation of inactive neurons. While traditional approaches focus on ReLU-based LLMs, leveraging zeros in activation values, we broaden the scope of sparse LLMs beyond zero activation values. We introduce a general method that defines neuron activation through neuron output magnitudes and a tailored magnitude threshold, demonstrating that non-ReLU LLMs also exhibit sparse activation. To find the most efficient activation function for sparse computation, we propose a systematic framework to examine the sparsity of LLMs from three aspects: the trade-off between sparsity and performance, the predictivity of sparsity, and the hardware affinity. We conduct thorough experiments on LLMs utilizing different activation functions, including ReLU, SwiGLU, ReGLU, and ReLU$^2$. The results indicate that models employing ReLU$^2$ excel across all three evaluation aspects, highlighting its potential as an efficient activation function for sparse LLMs. We will release the code to facilitate future research.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the case of limited resources, how to improve the inference efficiency of large language models (LLMs) through sparse computing. Specifically, the paper explores going beyond the traditional ReLU activation function and finding more efficient activation functions to achieve a higher proportion of sparse activation, thereby optimizing the deployment and inference performance of large language models in low - resource environments. ### Problem Background Although large language models (LLMs) show great potential in deep learning, their inference processes require a large amount of computing and storage resources, which makes them difficult to be deployed in resource - constrained environments. To address this challenge, sparse computing has become a promising direction, reducing resource consumption by dynamically skipping the computation of inactive neurons. ### Main Contributions of the Paper 1. **Expand the Definition of Sparse Activation**: - Traditional methods only focus on neurons with zero activation values, while this paper proposes a new definition based on the output magnitude of neurons and introduces a magnitude threshold to determine whether a neuron is activated. 2. **Systematic Framework for Evaluating Sparse Computing**: - A framework for evaluating the efficiency of sparse computing from three aspects is proposed: the trade - off between sparsity and performance, the predictability of sparsity, and hardware affinity. 3. **Experimental Verification**: - Through experiments on different activation functions (including ReLU, SwiGLU, ReGLU, and ReLU2), it is found that ReLU2 performs best in all three evaluation aspects, showing its potential as an efficient activation function for sparse LLMs. ### Formulas and Key Concepts - **Output Magnitude Distribution**: \[ \text{CETT}(x)=\frac{\left\|\sum_{i\in D}n_i(x)\right\|_2}{\left\|\text{FFN}(x)\right\|_2},\quad D = \{ i\mid\|n_i(x)\|_2 < \epsilon\} \] where \(\epsilon\) is a threshold, \(D\) is the set of neurons with output magnitudes less than \(\epsilon\), and \(\|\cdot\|_2\) represents the L2 - norm. - **Sparsity Ratio**: \[ \text{Sparsity Ratio}=\frac{|D|}{d_{ff}} \] where \(|D|\) is the number of neurons with output magnitudes less than \(\epsilon\), and \(d_{ff}\) is the total number of neurons. - **Co - activation Frequency Matrix**: \[ M_{ij}=\frac{\text{co - activation times of the }i\text{-th and }j\text{-th neurons}}{\text{activation times of the }i\text{-th neuron}},\quad M_{ii} = 0 \] It is used to measure the co - activation relationship between neurons. ### Conclusion Through the above research, the paper shows that the ReLU2 activation function performs excellently in sparse computing. It can not only achieve high sparsity while maintaining performance, but also has high predictability and good hardware affinity. Therefore, it is a very promising choice of activation function for sparse LLMs.