Abstract:Despite their remarkable achievement, gigantic transformers encounter significant drawbacks, including exorbitant computational and memory footprints during training, as well as severe collapse evidenced by a high degree of parameter redundancy. Sparsely-activated Mixture-of-Experts (SMoEs) have shown promise to mitigate the issue of training efficiency, yet they are prone to (1) redundant experts due to representational collapse; and (2) poor expert scalability for inference and downstream fine-tuning, primarily due to overfitting of the learned routing policy to the number of activated experts during training. As recent research efforts are predominantly focused on improving routing policies to encourage expert specializations, this work focuses on exploring the overlooked scalability bottleneck of SMoEs and leveraging it to effectively scale dense transformers. To this end, we propose a new plug-and-play training framework, SMoE-Dropout, to enable scaling transformers to better accuracy in their full capacity without collapse. Specifically, SMoE-Dropout consists of a randomly initialized and fixed router network to activate experts and gradually increases the activated expert number as training progresses over time. Transformers trained by SMoE-Dropout naturally exhibit a self-slimmable property subject to resource availability, offering smooth and consistent performance boosts with an increase in activated experts during inference or fine-tuning. Our extensive experiments demonstrate the superior performance and substantial computation savings of SMoE-Dropout, compared to dense training baselines with equivalent parameter counts. In particular, our trained BERT outperforms its densely trained counterpart with consistent improvements of {1.03%, 0.78%, 1.09%} on challenging reasoning tasks {ASDiv-A, MAWPS, SVAMP}, respectively.

BERT Busters: Outlier Dimensions that Disrupt Transformers

Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

Understanding and Minimising Outlier Features in Neural Network Training

Can persistent homology whiten Transformer-based black-box models? A case study on BERT compression

Abrupt Learning in Transformers: A Case Study on Matrix Completion

ProTransformer: Robustify Transformers via Plug-and-Play Paradigm

Understanding the Difficulty of Training Transformers

STAT: Shrinking Transformers After Training

Transformers need glasses! Information over-squashing in language tasks

Pretrained Transformers Do not Always Improve Robustness

Outlier Dimensions Encode Task-Specific Knowledge

Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality

What Matters in Transformers? Not All Attention is Needed

The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter

Transformer on a Diet

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

Not all layers are equally as important: Every Layer Counts BERT

Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance

Exploring Extreme Parameter Compression for Pre-trained Language Models