Abhishek Panigrahi,Nikunj Saunshi,Kaifeng Lyu,Sobhan Miryoosefi,Sashank Reddi,Satyen Kale,Sanjiv Kumar
Abstract:Recent developments in large language models have sparked interest in efficient pretraining methods. Stagewise training approaches to improve efficiency, like gradual stacking and layer dropping (Reddi et al, 2023; Zhang & He, 2020), have recently garnered attention. The prevailing view suggests that stagewise dropping strategies, such as layer dropping, are ineffective, especially when compared to stacking-based approaches. This paper challenges this notion by demonstrating that, with proper design, dropping strategies can be competitive, if not better, than stacking methods. Specifically, we develop a principled stagewise training framework, progressive subnetwork training, which only trains subnetworks within the model and progressively increases the size of subnetworks during training, until it trains the full network. We propose an instantiation of this framework - Random Part Training (RAPTR) - that selects and trains only a random subnetwork (e.g. depth-wise, width-wise) of the network at each step, progressively increasing the size in stages. We show that this approach not only generalizes prior works like layer dropping but also fixes their key issues. Furthermore, we establish a theoretical basis for such approaches and provide justification for (a) increasing complexity of subnetworks in stages, conceptually diverging from prior works on layer dropping, and (b) stability in loss across stage transitions in presence of key modern architecture components like residual connections and layer norms. Through comprehensive experiments, we demonstrate that RAPTR can significantly speed up training of standard benchmarks like BERT and UL2, up to 33% compared to standard training and, surprisingly, also shows better downstream performance on UL2, improving QA tasks and SuperGLUE by 1.5%; thereby, providing evidence of better inductive bias.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the pre - training efficiency of large - scale language models (such as BERT and UL2), especially by designing an effective phased - training method to reduce the consumption of computing resources and training time. Existing phased - training methods are mainly divided into two categories: gradual stacking and layer dropping. Although gradual stacking performs well in reducing the number of floating - point operations (FLOPs) and training time, its performance is very sensitive to the stacking plan, and it is unable to evaluate the performance of the complete model in the early stage. While the layer - dropping strategy can save computing resources, it performs poorly when learning complex features.
To solve these problems, this paper proposes a new phased - training framework - **progressive subnetwork training**, and specifically implements a method called **Random Part Training (RaPTr)**. This method selects and trains random sub - networks in the model at each training stage, and gradually increases the size of the sub - networks, and finally trains the entire model. This strategy can not only significantly accelerate the training process, but also maintain or even improve the performance of downstream tasks.
### Main contributions
1. **Introduced the progressive subnetwork training framework**: This framework extends the previous layer - dropping strategy and allows more flexible selection and training of sub - networks.
2. **Proposed Random Part Training (RaPTr)**: This is a specific instance of progressive subnetwork training. By randomly selecting sub - networks and gradually increasing their size during the training process, efficient training is achieved.
3. **Theoretical analysis and experimental proof**: Through experiments and theoretical analysis of polynomial data, the effectiveness of gradually increasing the complexity of sub - networks is proved, and the advantages of RaPTr in learning high - order features are demonstrated.
4. **Extensive experimental verification**: A large number of experiments have been carried out on large - scale language models such as BERT and UL2, proving the superiority of RaPTr in training efficiency and model quality.
### Experimental results
- **BERT - Base**: RaPTr achieves performance comparable to baseline training under similar FLOPs, and performs better than the gradual stacking method, while reducing FLOPs by about 33%.
- **UL2 - 1.6B**: RaPTr is not only comparable to baseline training in pre - training loss, but also shows better performance in multiple downstream tasks, especially in question - answering and SuperGLUE tasks, with an average improvement of 1.5%.
In conclusion, this paper solves the limitations of existing phased - training methods by proposing the progressive subnetwork training framework and the specific RaPTr method, providing a more efficient and effective training strategy.