Landscape-Aware Growing: The Power of a Little LAG

Stefani Karp,Nikunj Saunshi,Sobhan Miryoosefi,Sashank J. Reddi,Sanjiv Kumar
2024-06-05
Abstract:Recently, there has been increasing interest in efficient pretraining paradigms for training Transformer-based models. Several recent approaches use smaller models to initialize larger models in order to save computation (e.g., stacking and fusion). In this work, we study the fundamental question of how to select the best growing strategy from a given pool of growing strategies. Prior works have extensively focused on loss- and/or function-preserving behavior at initialization or simply performance at the end of training. Instead, we identify that behavior at initialization can be misleading as a predictor of final performance and present an alternative perspective based on early training dynamics, which we call "landscape-aware growing (LAG)". We perform extensive analysis of correlation of the final performance with performance in the initial steps of training and find early and more accurate predictions of the optimal growing strategy (i.e., with only a small "lag" after initialization). This perspective also motivates an adaptive strategy for gradual stacking.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The paper primarily explores how to effectively scale up models (i.e., "model growing" strategies) when training large-scale language models based on Transformers, particularly focusing on the issue of increasing the number of model layers in the depth direction. The authors challenge a commonly adopted view in previous research, which suggests that maintaining loss or functional invariance is a good strategy for initializing larger models. The key contributions of the paper include: 1. **Questioning the effectiveness of loss preservation as a criterion for selecting the best growth strategy**: Through empirical analysis, it was found that the initial loss value is not strongly correlated with the final model performance, indicating that loss preservation may not be a good guiding principle for choosing the best growth strategy. 2. **Proposing a new perspective—"Landscape-Aware Growing" (LAG)**: The paper points out that in the early training stages after model initialization (e.g., after a few thousand steps of training), there is a strong correlation between the model's loss value and its final performance. This observation supports the hypothesis that the model's loss stabilizes after a brief adaptation period, and this stable state can well predict the final performance. 3. **Further exploring the optimal time point for early prediction**: The study found that within a few hundred training steps after model expansion, it is possible to very accurately predict which growth strategies perform best. This means that high-performing model growth strategies can be identified in a very short time. 4. **Proposing the LAG algorithm and its application**: Based on the above findings, the authors propose the LAG algorithm, a simple method to determine the optimal growth strategy. The LAG algorithm first involves short-term training of different model growth strategies and then selecting one for long-term training based on their short-term performance. 5. **Experimental validation on BERT and UL2 models**: Experiments on BERT-Base and the large-scale self-supervised language model UL2 demonstrate the effectiveness of the LAG algorithm, showing that it can find near-optimal growth strategies and outperform other traditional growth methods. In summary, the main goal of this paper is to address how to efficiently select the best model growth strategy. By proposing the LAG method, it provides a new theoretical foundation and practical tool for model growing.