Abstract:Recent developments in large language models have sparked interest in efficient pretraining methods. Stagewise training approaches to improve efficiency, like gradual stacking and layer dropping (Reddi et al, 2023; Zhang & He, 2020), have recently garnered attention. The prevailing view suggests that stagewise dropping strategies, such as layer dropping, are ineffective, especially when compared to stacking-based approaches. This paper challenges this notion by demonstrating that, with proper design, dropping strategies can be competitive, if not better, than stacking methods. Specifically, we develop a principled stagewise training framework, progressive subnetwork training, which only trains subnetworks within the model and progressively increases the size of subnetworks during training, until it trains the full network. We propose an instantiation of this framework - Random Part Training (RAPTR) - that selects and trains only a random subnetwork (e.g. depth-wise, width-wise) of the network at each step, progressively increasing the size in stages. We show that this approach not only generalizes prior works like layer dropping but also fixes their key issues. Furthermore, we establish a theoretical basis for such approaches and provide justification for (a) increasing complexity of subnetworks in stages, conceptually diverging from prior works on layer dropping, and (b) stability in loss across stage transitions in presence of key modern architecture components like residual connections and layer norms. Through comprehensive experiments, we demonstrate that RAPTR can significantly speed up training of standard benchmarks like BERT and UL2, up to 33% compared to standard training and, surprisingly, also shows better downstream performance on UL2, improving QA tasks and SuperGLUE by 1.5%; thereby, providing evidence of better inductive bias.

G2Basy: A Framework to Improve the RNN Language Model and Ease Overfitting Problem.

Using Context-to-Vector with Graph Retrofitting to Improve Word Embeddings

Exploration of Tree-based Hierarchical Softmax for Recurrent Language Models

Incrementally Learning the Hierarchical Softmax Function for Neural Language Models

Bi-Drop: Enhancing Fine-tuning Generalization Via Synchronous Sub-Net Estimation and Optimization

Evolving Subnetwork Training for Large Language Models

Training With Additional Semantic Constraints For Enhancing Neural Machine Translation

Efficient Stagewise Pretraining via Progressive Subnetworks

Recurrent Neural Network Language Model With Structured Word Embeddings For Speech Recognition

Optimization of Recurrent Neural Networks on Natural Language Processing.

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability.

Word-Level Permutation and Improved Lower Frame Rate for RNN-Based Acoustic Modeling.

Efficient and effective training of language and graph neural network models

Loop Neural Networks for Parameter Sharing

Dropout Token To Improve Neural Language Model

Adaptive Optimization for Enhanced Efficiency in Large-Scale Language Model Training

Advanced Recurrent Network-Based Hybrid Acoustic Models for Low Resource Speech Recognition

LanYUAN, a GPT Large Model Using Curriculum Learning and Sparse Attention.

LightRNN: Memory and Computation-Efficient Recurrent Neural Networks

Speed Up the Training of Neural Machine Translation

Accelerating the Training of Large Language Models Using Efficient Activation Rematerialization and Optimal Hybrid Parallelism.