Abstract:We introduce, Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs). Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference. This is achieved by applying top-K sparsification to the activations and the straight-through-estimator to the training. We also introduce Block Q-Sparse for batch training and inference. The key results from this work are, (1) Q-Sparse can achieve results comparable to those of baseline LLMs while being much more efficient at inference time; (2) We present an inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is effective in different settings, including training-from-scratch, continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.

What problem does this paper attempt to address?

This paper attempts to solve the problems of high computational cost and memory occupation faced by large - language models (LLMs) in practical applications, especially in the inference stage. Specifically, the paper introduces a method named Q - Sparse, aiming to improve the efficiency of LLMs by achieving fully sparse activation. Q - Sparse achieves this goal by applying top - K sparsification and Straight - Through Estimator (STE) during the activation process, and is applicable to multiple scenarios such as training from scratch, continuing to train pre - trained models, and fine - tuning. In addition, the paper also proposes Block Q - Sparse to make Q - Sparse compatible with batch training and inference. ### Main contributions: 1. **Q - Sparse method**: By applying top - K sparsification and Straight - Through Estimator during the activation process, fully sparse activation is achieved, thereby significantly improving inference efficiency. 2. **Inference optimal scaling law**: The inference optimal scaling law for sparse - activated LLMs is proposed, indicating that under the same inference computational budget, the performance of sparse - activated models is better than that of dense models. 3. **Effectiveness in multiple application scenarios**: The effectiveness of Q - Sparse in multiple scenarios such as training from scratch, continuing to train pre - trained models, and fine - tuning is proved. 4. **Compatibility of full - precision and 1 - bit models**: Q - Sparse is applicable not only to full - precision models, but also to 1 - bit models (such as BitNet b1.58), and can be combined with the Mixture of Experts (MoE) mechanism to further improve efficiency. ### Key results: - **Performance comparison**: Q - Sparse significantly improves inference efficiency while maintaining performance comparable to that of baseline LLMs. - **Inference optimal scaling law**: The inference optimal scaling law for sparse - activated LLMs is proposed, indicating that under the same inference computational budget, the performance of sparse - activated models is better than that of dense models. - **Effectiveness in different scenarios**: Q - Sparse performs well in scenarios such as training from scratch, continuing to train pre - trained models, and fine - tuning. - **Compatibility of full - precision and 1 - bit models**: Q - Sparse is applicable not only to full - precision models, but also to 1 - bit models, and especially when combined with BitNet b1.58, the efficiency can be significantly improved. ### Formula analysis: - **Sparsification of matrix multiplication**: \[ Y=(X\odot M)\cdot W^{T} \] where \(M = \text{Topk}(|X|)\), which means selecting the top \(K\) elements with the largest absolute values in the input tensor \(X\) as the mask. - **Quantized sparsification**: \[ Y=(Q(X)\odot M)\cdot W^{T} \] where \(Q(X)\) is the quantization function, defined as: \[ Q(X)=\text{RoundClip}\left(\frac{127}{\gamma+\epsilon}X, - 128,127\right) \] \[ \gamma=\max(|X|) \] \[ \text{RoundClip}(X,a,b)=\min(\max(\text{round}(X),a),b) \] - **Squared ReLU function**: \[ \text{ReLU2GLU}(X)=XW_{\text{up}}^{T}\odot\text{ReLU}^{2}(XW_{\text{gate}}^{T}) \] - **Inference optimal scaling law**: \[ L(N,S)\triangleq E+\frac{A(S)}{N^{\alpha}} \] \[ A(S)=B + C\exp\left(\frac{\beta}{1 - S}\right) \] where \(L(N,S)\) represents the number of parameters \(N\)

Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

Learn To be Efficient: Build Structured Sparsity in Large Language Models

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Training-Free Activation Sparsity in Large Language Models

Sparsity-Accelerated Training for Large Language Models

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

Activation Sparsity Opportunities for Compressing General Large Language Models

MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

Search for Efficient Large Language Models

Achieving Sparse Activation in Small Language Models

One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments

ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse LLMs

SparseLLM: Towards Global Pruning for Pre-trained Language Models

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

SparQ Attention: Bandwidth-Efficient LLM Inference

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

SqueezeLLM: Dense-and-Sparse Quantization

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

SDQ: Sparse Decomposed Quantization for LLM Inference