Abstract:Large Language Models (LLMs) have achieved remarkable success with their billion-level parameters, yet they incur high inference overheads. The emergence of activation sparsity in LLMs provides a natural approach to reduce this cost by involving only parts of the parameters for inference. However, existing methods only focus on utilizing this naturally formed activation sparsity in a post-training setting, overlooking the potential for further amplifying this inherent sparsity. In this paper, we hypothesize that LLMs can learn to be efficient by achieving more structured activation sparsity. To achieve this, we introduce a novel training algorithm, Learn-To-be-Efficient (LTE), designed to train efficiency-aware LLMs to learn to activate fewer neurons and achieve a better trade-off between sparsity and performance. Furthermore, unlike SOTA MoEfication methods, which mainly focus on ReLU-based models, LTE can also be applied to LLMs like LLaMA using non-ReLU activations. Extensive evaluation on language understanding, language generation, and instruction tuning tasks show that LTE consistently outperforms SOTA baselines. Along with our hardware-aware custom kernel implementation, LTE reduces LLaMA2-7B inference latency by 25% at 50% sparsity.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the high - cost problem generated during the reasoning process of large - language models (LLMs). Although these models have reached the scale of billions in terms of the number of parameters and have achieved remarkable success in natural - language - processing tasks, their high computational and memory requirements make the deployment costly, especially in application scenarios requiring low latency, such as chatbots and autonomous driving. In addition, existing methods mainly focus on utilizing the naturally - formed activation sparsity in pre - trained models while ignoring the potential to further amplify this inherent sparsity. ### Specific problems 1. **High reasoning cost**: The computational and memory requirements of LLMs during the reasoning process are very high, which leads to an increase in deployment costs and affects the user experience. 2. **Limitations of existing methods**: Existing methods mainly focus on utilizing the natural activation sparsity in pre - trained models while ignoring the potential for further optimizing this sparsity. 3. **Challenges of non - ReLU activation functions**: Emerging advanced models use soft activation functions (such as SwiGLU and GeGLU), and these models have lower natural activation sparsity, making it difficult for existing methods to be directly applied. ### Solutions To solve the above problems, the author proposes a new training algorithm named Learn - To - be - Efficient (LTE), aiming to train more efficient LLMs and reduce the reasoning cost by achieving more structured activation sparsity. Specifically, the LTE algorithm achieves this goal in the following ways: 1. **Introducing efficiency - loss penalty**: Add an efficiency - loss term during the training process to encourage the model to activate fewer neurons in the feed - forward network (FFN) layer while maintaining good task performance. 2. **Threshold - based Sigmoid routing strategy**: Adopt a threshold - based Sigmoid routing strategy to select experts instead of a fixed number of experts, thereby achieving more flexible expert selection. 3. **Two - stage training mechanism**: Improve training stability through a two - stage training mechanism. In the first stage, jointly train the model and the router, and in the second stage, adaptively adjust the model to adapt to the discrete selection mode. ### Experimental results The experimental results show that LTE outperforms existing baseline methods on multiple tasks (including natural - language understanding, natural - language generation, and instruction tuning). In particular, on the LLaMA2 - 7B model, LTE reduces the reasoning latency by 25% at 50% sparsity and provides an acceleration of 1.83 to 2.59 times in terms of FLOPs. ### Conclusion By achieving more structured activation sparsity, LTE effectively reduces the reasoning cost of LLMs, improves the reasoning efficiency of the model, and at the same time maintains the performance of the model. This provides a new solution for the efficient deployment of large - scale language models in practical applications.

Learn To be Efficient: Build Structured Sparsity in Large Language Models

Training-Free Activation Sparsity in Large Language Models

Sparsity-Accelerated Training for Large Language Models

Activation Sparsity Opportunities for Compressing General Large Language Models

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

Search for Efficient Large Language Models

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse LLMs

Scaling Sparse Fine-Tuning to Large Language Models

Achieving Sparse Activation in Small Language Models

CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

SparseLLM: Towards Global Pruning for Pre-trained Language Models

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference

SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking