Abstract:Recent developments in large language models have sparked interest in efficient pretraining methods. Stagewise training approaches to improve efficiency, like gradual stacking and layer dropping (Reddi et al, 2023; Zhang & He, 2020), have recently garnered attention. The prevailing view suggests that stagewise dropping strategies, such as layer dropping, are ineffective, especially when compared to stacking-based approaches. This paper challenges this notion by demonstrating that, with proper design, dropping strategies can be competitive, if not better, than stacking methods. Specifically, we develop a principled stagewise training framework, progressive subnetwork training, which only trains subnetworks within the model and progressively increases the size of subnetworks during training, until it trains the full network. We propose an instantiation of this framework - Random Part Training (RAPTR) - that selects and trains only a random subnetwork (e.g. depth-wise, width-wise) of the network at each step, progressively increasing the size in stages. We show that this approach not only generalizes prior works like layer dropping but also fixes their key issues. Furthermore, we establish a theoretical basis for such approaches and provide justification for (a) increasing complexity of subnetworks in stages, conceptually diverging from prior works on layer dropping, and (b) stability in loss across stage transitions in presence of key modern architecture components like residual connections and layer norms. Through comprehensive experiments, we demonstrate that RAPTR can significantly speed up training of standard benchmarks like BERT and UL2, up to 33% compared to standard training and, surprisingly, also shows better downstream performance on UL2, improving QA tasks and SuperGLUE by 1.5%; thereby, providing evidence of better inductive bias.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the pre - training efficiency of large - scale language models (such as BERT and UL2), especially by designing an effective phased - training method to reduce the consumption of computing resources and training time. Existing phased - training methods are mainly divided into two categories: gradual stacking and layer dropping. Although gradual stacking performs well in reducing the number of floating - point operations (FLOPs) and training time, its performance is very sensitive to the stacking plan, and it is unable to evaluate the performance of the complete model in the early stage. While the layer - dropping strategy can save computing resources, it performs poorly when learning complex features. To solve these problems, this paper proposes a new phased - training framework - **progressive subnetwork training**, and specifically implements a method called **Random Part Training (RaPTr)**. This method selects and trains random sub - networks in the model at each training stage, and gradually increases the size of the sub - networks, and finally trains the entire model. This strategy can not only significantly accelerate the training process, but also maintain or even improve the performance of downstream tasks. ### Main contributions 1. **Introduced the progressive subnetwork training framework**: This framework extends the previous layer - dropping strategy and allows more flexible selection and training of sub - networks. 2. **Proposed Random Part Training (RaPTr)**: This is a specific instance of progressive subnetwork training. By randomly selecting sub - networks and gradually increasing their size during the training process, efficient training is achieved. 3. **Theoretical analysis and experimental proof**: Through experiments and theoretical analysis of polynomial data, the effectiveness of gradually increasing the complexity of sub - networks is proved, and the advantages of RaPTr in learning high - order features are demonstrated. 4. **Extensive experimental verification**: A large number of experiments have been carried out on large - scale language models such as BERT and UL2, proving the superiority of RaPTr in training efficiency and model quality. ### Experimental results - **BERT - Base**: RaPTr achieves performance comparable to baseline training under similar FLOPs, and performs better than the gradual stacking method, while reducing FLOPs by about 33%. - **UL2 - 1.6B**: RaPTr is not only comparable to baseline training in pre - training loss, but also shows better performance in multiple downstream tasks, especially in question - answering and SuperGLUE tasks, with an average improvement of 1.5%. In conclusion, this paper solves the limitations of existing phased - training methods by proposing the progressive subnetwork training framework and the specific RaPTr method, providing a more efficient and effective training strategy.

Efficient Stagewise Pretraining via Progressive Subnetworks

Efficient Training of BERT by Progressively Stacking.

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Evolving Subnetwork Training for Large Language Models

A Framework for Provably Stable and Consistent Training of Deep Feedforward Networks

LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking

Revisiting Token Dropping Strategy in Efficient BERT Pretraining

Learning Hierarchical Structures with Differentiable Nondeterministic Stacks

Stimulative Training++: Go Beyond The Performance Limits of Residual Networks

Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review

Read Between the Layers: Leveraging Multi-Layer Representations for Rehearsal-Free Continual Learning with Pre-Trained Models

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models

Landscape-Aware Growing: The Power of a Little LAG

Layer-wise Learning Rate Optimization for Task-Dependent Fine-Tuning of Pre-trained Models: An Evolutionary Approach

Towards Structured Dynamic Sparse Pre-Training of BERT

An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training

Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively

reStructured Pre-training

Bi-Drop: Enhancing Fine-tuning Generalization Via Synchronous Sub-Net Estimation and Optimization

RoBERTa: A Robustly Optimized BERT Pretraining Approach