Abstract:LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical $\underline{\textit{O}}$bstacles: ($\textit{O}$1) lack of comprehensive evaluation, ($\textit{O}$2) untested viability for scaling, and ($\textit{O}$3) lack of empirical guidelines. To tackle $\textit{O}$1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called $G_{\text{stack}}$, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into $G_{\text{stack}}$ to address $\textit{O}$2 and $\textit{O}$3. For $\textit{O}$2 (untested scalability), our study shows that $G_{\text{stack}}$ is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our $G_{\text{stack}}$ model converges to the same loss with 194B tokens, resulting in a 54.6\% speedup. We further address $\textit{O}$3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for $G_{\text{stack}}$, making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of $G_{\text{stack}}$. Our code and pre-trained model are available at <a class="link-external link-https" href="https://llm-stacking.github.io" rel="external noopener nofollow">this https URL</a>.

Efficient Training of BERT by Progressively Stacking.

Efficient Stagewise Pretraining via Progressive Subnetworks

FEDBFPT: an Efficient Federated Learning Framework for BERT Further Pre-Training

Boosting Distributed Training Performance of the Unpadded BERT Model

bert2BERT: Towards Reusable Pretrained Language Models

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Further Boosting BERT-based Models by Duplicating Existing Layers: Some Intriguing Phenomena inside BERT

EarlyBERT: Efficient BERT Training Via Early-bird Lottery Tickets

Towards Structured Dynamic Sparse Pre-Training of BERT

BPDec: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining

SmartBERT: A Promotion of Dynamic Early Exiting Mechanism for Accelerating BERT Inference.

Stacked DeBERT: All attention in incomplete data for text classification

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

A Multi-Level Framework for Accelerating Training Transformer Models

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

A new computationally efficient method to tune BERT networks – transfer learning

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

NarrowBERT: Accelerating Masked Language Model Pretraining and Inference

Stacked Broad Learning System: From Incremental Flatted Structure to Deep Model

BERT's output layer recognizes all hidden layers? Some Intriguing Phenomena and a simple way to boost BERT

DPBERT: Efficient Inference for BERT based on Dynamic Planning