Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization

Mohammad Samragh,Iman Mirzadeh,Keivan Alizadeh Vahid,Fartash Faghri,Minsik Cho,Moin Nabi,Devang Naik,Mehrdad Farajtabar
2024-09-21
Abstract:The pre-training phase of language models often begins with randomly initialized parameters. With the current trends in scaling models, training their large number of parameters can be extremely slow and costly. In contrast, small language models are less expensive to train, but they often cannot achieve the accuracy of large models. In this paper, we explore an intriguing idea to connect these two different regimes: Can we develop a method to initialize large language models using smaller pre-trained models? Will such initialization bring any benefits in terms of training time and final accuracy? In this paper, we introduce HyperCloning, a method that can expand the parameters of a pre-trained language model to those of a larger model with increased hidden dimensions. Our method ensures that the larger model retains the functionality of the smaller model. As a result, the larger model already inherits the predictive power and accuracy of the smaller model before the training starts. We demonstrate that training such an initialized model results in significant savings in terms of GPU hours required for pre-training large language models.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issues of high cost and time consumption during the pre-training of Large Language Models (LLMs). Specifically: 1. **Pre-training Cost Issue**: Current large language models typically start training from randomly initialized parameters, which is not only slow but also costly. For example, training a model with 12 billion parameters requires approximately 72,000 GPU hours, which is very expensive at public cloud computing resource prices. 2. **Trade-off Between Small and Large Models**: While small language models have lower training costs and less environmental impact, their accuracy often cannot reach the level of large models. Therefore, to pursue high performance, the industry tends to use larger models, but the training costs of these models are very high. To solve the above problems, the paper proposes a method called **HyperCloning**, which can expand from a smaller pre-trained model to a larger model, thereby significantly reducing training time and cost while retaining the functionality of the small model. Through this method, large models can inherit the predictive capabilities and accuracy of small models at the early stages of training and further improve performance through training. ### Main Contributions 1. **Function-Preserving Initialization**: Ensures that the large model can accurately replicate the functionality of the small model after initialization. 2. **Accelerated Training**: Experiments show that models initialized with HyperCloning converge faster and achieve higher final accuracy compared to traditional random initialization under the same training budget. 3. **Specific Implementation of Cloning Linear Layers**: Provides a detailed explanation of how to clone linear layers in different scenarios to achieve function preservation. 4. **Model Performance Analysis**: Validates the effectiveness and advantages of HyperCloning by comparing the effects of different expansion strategies.