$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources

Apoorv Khandelwal,Tian Yun,Nihal V. Nayak,Jack Merullo,Stephen H. Bach,Chen Sun,Ellie Pavlick
2024-10-31
Abstract:Pre-training is notoriously compute-intensive and academic researchers are notoriously under-resourced. It is, therefore, commonly assumed that academics can't pre-train models. In this paper, we seek to clarify this assumption. We first survey academic researchers to learn about their available compute and then empirically measure the time to replicate models on such resources. We introduce a benchmark to measure the time to pre-train models on given GPUs and also identify ideal settings for maximizing training speed. We run our benchmark on a range of models and academic GPUs, spending 2,000 GPU-hours on our experiments. Our results reveal a brighter picture for academic pre-training: for example, although Pythia-1B was originally trained on 64 GPUs for 3 days, we find it is also possible to replicate this model (with the same hyper-parameters) in 3x fewer GPU-days: i.e. on 4 GPUs in 18 days. We conclude with a cost-benefit analysis to help clarify the trade-offs between price and pre-training time. We believe our benchmark will help academic researchers conduct experiments that require training larger models on more data. We fully release our codebase at: <a class="link-external link-https" href="https://github.com/apoorvkh/academic-pretraining" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the resource limitation problem faced by academic researchers when pre - training large - scale language and vision models. Specifically: 1. **Resource Limitation**: Pre - training models usually require a large amount of computing resources, and academic researchers often do not have sufficient resources to support these computing requirements. For example, the Pythia - 1B model was initially trained on 64 GPUs for 3 days, while Roberta requires 1000 GPUs for 1 day. Such high costs make it impossible for many academic laboratories to conduct pre - training experiments. 2. **Lack of Transparency**: There is a lack of transparency in the academic community regarding the resources and time required for pre - training. Researchers are not clear about how long pre - training will take under a given GPU configuration, which models can be trained, and which models cannot be realized. This opacity hinders students and supervisors from proposing more realistic experimental plans and budget applications. 3. **Optimization Requirements**: The paper aims to help academic researchers conduct pre - training more efficiently under limited resources through systematic benchmarking and optimization methods. The authors have proven through experiments that by using appropriate optimization techniques, pre - training tasks can be completed with fewer computing resources under existing hardware conditions. For example, the Pythia - 1B model can be trained on 4 A100 GPUs for 18 days instead of the original 192 GPU - days. 4. **Cost - Benefit Analysis**: The paper also conducts a cost - benefit analysis to help researchers select the most appropriate hardware configuration. For example, purchasing 4 H100 GPUs (about $60,000) is more economical than purchasing 8 A100 GPUs (about $160,000) because both can complete the training of Pythia - 1B within the same time. Overall, the goal of this paper is to provide a method that enables academic researchers to better understand and utilize existing computing resources, thereby making more progress in pre - training large - scale models.