Abstract:Pre-trained Transformers inherently possess the characteristic of sparse activation, where only a small fraction of the neurons are activated for each token. While sparse activation has been explored through post-training methods, its potential in pre-training remains untapped. In this work, we first study how activation properties change during pre-training. Our examination reveals that Transformers exhibit sparse activation throughout the majority of the pre-training process while the activation correlation keeps evolving as training progresses. Leveraging this observation, we propose Switchable Sparse-Dense Learning (SSD). SSD adaptively switches between the Mixtures-of-Experts (MoE) based sparse training and the conventional dense training during the pre-training process, leveraging the efficiency of sparse training and avoiding the static activation correlation of sparse training. Compared to dense training, SSD achieves comparable performance with identical model size and reduces pre-training costs. Moreover, the models trained with SSD can be directly used as MoE models for sparse inference and achieve the same performance as dense models with up to $2\times$ faster inference speed. Codes are available at <a class="link-external link-https" href="https://github.com/thunlp/moefication" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to utilize the potential of activation sparsity during pre - training to improve the efficiency of Transformer models. Specifically, the researchers observed that during pre - training, most of the time, Transformer models exhibit activation sparsity, that is, only a small number of neurons are activated. However, existing research mainly focuses on using this phenomenon through post - training methods to accelerate the inference process, and the potential of activation sparsity in the pre - training phase has not been fully explored. To fill this gap, the authors proposed the Switchable Sparse - Dense Learning (SSD) method. The SSD method dynamically switches between sparse training and dense training during pre - training, aiming to utilize the efficiency of sparse training while avoiding the static activation correlation problems brought by sparse training. In this way, SSD can achieve performance comparable to that of dense training with the same model scale and reduce pre - training costs. In addition, the model trained with SSD can be directly used as a Mixture - of - Experts (MoE) model for sparse inference, achieving up to a two - fold increase in inference speed while maintaining performance comparable to that of the dense model. ### Main Contributions 1. **Observation of Activation Sparsity**: Through experiments on pre - trained models of different architectures (such as GPT, BERT, and T5), the authors found that these models begin to exhibit activation sparsity in the early stages of pre - training, and this sparsity remains stable throughout the pre - training process. 2. **Proposing the SSD Method**: The SSD method dynamically switches between sparse training and dense training during pre - training, taking advantage of the efficiency of sparse training while avoiding the problems that sparse training may bring. 3. **Performance Verification**: The experimental results show that compared with traditional dense training, the SSD method can achieve comparable performance while maintaining the same model scale and significantly reduce pre - training costs. In addition, the model trained with SSD can be directly used for sparse inference, achieving higher inference efficiency. ### Technical Details - **Sparse Training**: In the sparse training mode, the model is transformed into a Sparsely - activated Mixture - of - Experts (SMoE) model. Each expert is a feed - forward network, and the SMoE layer selectively activates some experts, thereby improving computational efficiency. - **Dense Training**: In the dense training mode, all model parameters are calculated and optimized to achieve better performance. - **Transition Mechanism**: When the activation sparsity is high and the activation pattern is stable, the model switches from dense training to sparse training. Conversely, when it is necessary to avoid the problems brought by sparse training, the model switches from sparse training back to dense training. ### Formulas - **Dense Computation**: \[ \text{FFN}(x)=W_{o}\sigma (W_{i}x + b_{i})+b_{o} \] where $W_{i}\in\mathbb{R}^{d_{\text{ff}}\times d_{\text{model}}}$, $W_{o}\in\mathbb{R}^{d_{\text{model}}\times d_{\text{ff}}}$, $b_{i}\in\mathbb{R}^{d_{\text{ff}}}$, $b_{o}\in\mathbb{R}^{d_{\text{model}}}$, $\sigma$ is the activation function, and $d_{\text{ff}}$ and $d_{\text{model}}$ are the dimensions of the intermediate layer and the input / output respectively. - **Sparse Computation**: \[ \text{FFN}_{\text{SMoE}}(x)=\sum_{n = 1}^{N}\alpha_{n}W_{o,n}\sigma (W_{i,n}x) \] where $W_{i,n}\in\mathbb{R}^{\frac{d_{\text{ff}}}{N}\times d_{\text{model}}}$

Exploring the Benefit of Activation Sparsity in Pre-training

Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

Mixed Sparsity Training: Achieving 4× FLOP Reduction for Transformer Pretraining

A Theoretical Explanation of Activation Sparsity Through Flat Minima and Adversarial Robustness

Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion

Accelerating Transformer Pre-training with 2:4 Sparsity

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

Learning Neural Networks with Sparse Activations

Mixture of Hidden-Dimensions Transformer

Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

Sparseness Analysis in the Pretraining of Deep Neural Networks

Dual sparse training framework: inducing activation map sparsity via Transformed $\ell1$ regularization

Sparse Upcycling: Inference Inefficient Finetuning

The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter

SparseMAE: Sparse Training Meets Masked Autoencoders.

Training-Free Activation Sparsity in Large Language Models

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism