Abstract:We introduce, Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs). Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference. This is achieved by applying top-K sparsification to the activations and the straight-through-estimator to the training. We also introduce Block Q-Sparse for batch training and inference. The key results from this work are, (1) Q-Sparse can achieve results comparable to those of baseline LLMs while being much more efficient at inference time; (2) We present an inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is effective in different settings, including training-from-scratch, continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.
What problem does this paper attempt to address?
This paper attempts to solve the problems of high computational cost and memory occupation faced by large - language models (LLMs) in practical applications, especially in the inference stage. Specifically, the paper introduces a method named Q - Sparse, aiming to improve the efficiency of LLMs by achieving fully sparse activation. Q - Sparse achieves this goal by applying top - K sparsification and Straight - Through Estimator (STE) during the activation process, and is applicable to multiple scenarios such as training from scratch, continuing to train pre - trained models, and fine - tuning. In addition, the paper also proposes Block Q - Sparse to make Q - Sparse compatible with batch training and inference.
### Main contributions:
1. **Q - Sparse method**: By applying top - K sparsification and Straight - Through Estimator during the activation process, fully sparse activation is achieved, thereby significantly improving inference efficiency.
2. **Inference optimal scaling law**: The inference optimal scaling law for sparse - activated LLMs is proposed, indicating that under the same inference computational budget, the performance of sparse - activated models is better than that of dense models.
3. **Effectiveness in multiple application scenarios**: The effectiveness of Q - Sparse in multiple scenarios such as training from scratch, continuing to train pre - trained models, and fine - tuning is proved.
4. **Compatibility of full - precision and 1 - bit models**: Q - Sparse is applicable not only to full - precision models, but also to 1 - bit models (such as BitNet b1.58), and can be combined with the Mixture of Experts (MoE) mechanism to further improve efficiency.
### Key results:
- **Performance comparison**: Q - Sparse significantly improves inference efficiency while maintaining performance comparable to that of baseline LLMs.
- **Inference optimal scaling law**: The inference optimal scaling law for sparse - activated LLMs is proposed, indicating that under the same inference computational budget, the performance of sparse - activated models is better than that of dense models.
- **Effectiveness in different scenarios**: Q - Sparse performs well in scenarios such as training from scratch, continuing to train pre - trained models, and fine - tuning.
- **Compatibility of full - precision and 1 - bit models**: Q - Sparse is applicable not only to full - precision models, but also to 1 - bit models, and especially when combined with BitNet b1.58, the efficiency can be significantly improved.
### Formula analysis:
- **Sparsification of matrix multiplication**:
\[
Y=(X\odot M)\cdot W^{T}
\]
where \(M = \text{Topk}(|X|)\), which means selecting the top \(K\) elements with the largest absolute values in the input tensor \(X\) as the mask.
- **Quantized sparsification**:
\[
Y=(Q(X)\odot M)\cdot W^{T}
\]
where \(Q(X)\) is the quantization function, defined as:
\[
Q(X)=\text{RoundClip}\left(\frac{127}{\gamma+\epsilon}X, - 128,127\right)
\]
\[
\gamma=\max(|X|)
\]
\[
\text{RoundClip}(X,a,b)=\min(\max(\text{round}(X),a),b)
\]
- **Squared ReLU function**:
\[
\text{ReLU2GLU}(X)=XW_{\text{up}}^{T}\odot\text{ReLU}^{2}(XW_{\text{gate}}^{T})
\]
- **Inference optimal scaling law**:
\[
L(N,S)\triangleq E+\frac{A(S)}{N^{\alpha}}
\]
\[
A(S)=B + C\exp\left(\frac{\beta}{1 - S}\right)
\]
where \(L(N,S)\) represents the number of parameters \(N\)