Abstract:Recently, there has been increasing interest in efficient pretraining paradigms for training Transformer-based models. Several recent approaches use smaller models to initialize larger models in order to save computation (e.g., stacking and fusion). In this work, we study the fundamental question of how to select the best growing strategy from a given pool of growing strategies. Prior works have extensively focused on loss- and/or function-preserving behavior at initialization or simply performance at the end of training. Instead, we identify that behavior at initialization can be misleading as a predictor of final performance and present an alternative perspective based on early training dynamics, which we call "landscape-aware growing (LAG)". We perform extensive analysis of correlation of the final performance with performance in the initial steps of training and find early and more accurate predictions of the optimal growing strategy (i.e., with only a small "lag" after initialization). This perspective also motivates an adaptive strategy for gradual stacking.

What problem does this paper attempt to address?

The paper primarily explores how to effectively scale up models (i.e., "model growing" strategies) when training large-scale language models based on Transformers, particularly focusing on the issue of increasing the number of model layers in the depth direction. The authors challenge a commonly adopted view in previous research, which suggests that maintaining loss or functional invariance is a good strategy for initializing larger models. The key contributions of the paper include: 1. **Questioning the effectiveness of loss preservation as a criterion for selecting the best growth strategy**: Through empirical analysis, it was found that the initial loss value is not strongly correlated with the final model performance, indicating that loss preservation may not be a good guiding principle for choosing the best growth strategy. 2. **Proposing a new perspective—"Landscape-Aware Growing" (LAG)**: The paper points out that in the early training stages after model initialization (e.g., after a few thousand steps of training), there is a strong correlation between the model's loss value and its final performance. This observation supports the hypothesis that the model's loss stabilizes after a brief adaptation period, and this stable state can well predict the final performance. 3. **Further exploring the optimal time point for early prediction**: The study found that within a few hundred training steps after model expansion, it is possible to very accurately predict which growth strategies perform best. This means that high-performing model growth strategies can be identified in a very short time. 4. **Proposing the LAG algorithm and its application**: Based on the above findings, the authors propose the LAG algorithm, a simple method to determine the optimal growth strategy. The LAG algorithm first involves short-term training of different model growth strategies and then selecting one for long-term training based on their short-term performance. 5. **Experimental validation on BERT and UL2 models**: Experiments on BERT-Base and the large-scale self-supervised language model UL2 demonstrate the effectiveness of the LAG algorithm, showing that it can find near-optimal growth strategies and outperform other traditional growth methods. In summary, the main goal of this paper is to address how to efficiently select the best model growth strategy. By proposing the LAG method, it provides a new theoretical foundation and practical tool for model growing.

Landscape-Aware Growing: The Power of a Little LAG

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

Staged Training for Transformer Language Models

Learning to Grow Pretrained Models for Efficient Transformer Training

LiteTransNet: an Interpretable Approach for Landslide Displacement Prediction Using Transformer Model with Attention Mechanism

Preparing Lessons for Progressive Training on Language Models

Understanding the Difficulty of Training Transformers

Efficient Loss Landscape Reshaping for Convolutional Neural Networks

Efficient Stagewise Pretraining via Progressive Subnetworks

Masked Structural Growth for 2x Faster Language Model Pre-training

Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Revisiting LARS for Large Batch Training Generalization of Neural Networks

Accelerating Attention through Gradient-Based Learned Runtime Pruning

Landscape Learning for Neural Network Inversion

Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

LPGD: A General Framework for Backpropagation through Embedded Optimization Layers

No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models

Overcoming Growth-Induced Forgetting in Task-Agnostic Continual Learning

LLM Performance Predictors are good initializers for Architecture Search

Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?