Abstract:Pre-training is notoriously compute-intensive and academic researchers are notoriously under-resourced. It is, therefore, commonly assumed that academics can't pre-train models. In this paper, we seek to clarify this assumption. We first survey academic researchers to learn about their available compute and then empirically measure the time to replicate models on such resources. We introduce a benchmark to measure the time to pre-train models on given GPUs and also identify ideal settings for maximizing training speed. We run our benchmark on a range of models and academic GPUs, spending 2,000 GPU-hours on our experiments. Our results reveal a brighter picture for academic pre-training: for example, although Pythia-1B was originally trained on 64 GPUs for 3 days, we find it is also possible to replicate this model (with the same hyper-parameters) in 3x fewer GPU-days: i.e. on 4 GPUs in 18 days. We conclude with a cost-benefit analysis to help clarify the trade-offs between price and pre-training time. We believe our benchmark will help academic researchers conduct experiments that require training larger models on more data. We fully release our codebase at: <a class="link-external link-https" href="https://github.com/apoorvkh/academic-pretraining" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the resource limitation problem faced by academic researchers when pre - training large - scale language and vision models. Specifically: 1. **Resource Limitation**: Pre - training models usually require a large amount of computing resources, and academic researchers often do not have sufficient resources to support these computing requirements. For example, the Pythia - 1B model was initially trained on 64 GPUs for 3 days, while Roberta requires 1000 GPUs for 1 day. Such high costs make it impossible for many academic laboratories to conduct pre - training experiments. 2. **Lack of Transparency**: There is a lack of transparency in the academic community regarding the resources and time required for pre - training. Researchers are not clear about how long pre - training will take under a given GPU configuration, which models can be trained, and which models cannot be realized. This opacity hinders students and supervisors from proposing more realistic experimental plans and budget applications. 3. **Optimization Requirements**: The paper aims to help academic researchers conduct pre - training more efficiently under limited resources through systematic benchmarking and optimization methods. The authors have proven through experiments that by using appropriate optimization techniques, pre - training tasks can be completed with fewer computing resources under existing hardware conditions. For example, the Pythia - 1B model can be trained on 4 A100 GPUs for 18 days instead of the original 192 GPU - days. 4. **Cost - Benefit Analysis**: The paper also conducts a cost - benefit analysis to help researchers select the most appropriate hardware configuration. For example, purchasing 4 H100 GPUs (about $60,000) is more economical than purchasing 8 A100 GPUs (about $160,000) because both can complete the training of Pythia - 1B within the same time. Overall, the goal of this paper is to provide a method that enables academic researchers to better understand and utilize existing computing resources, thereby making more progress in pre - training large - scale models.

$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources

Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines

Benchmarking Resource Usage for Efficient Distributed Deep Learning

Training a Large Video Model on a Single Machine in a Day

Communication-Efficient TeraByte-Scale Model Training Framework for Online Advertising

Cramming Protein Language Model Training in 24 GPU Hours

Time Matters: Scaling Laws for Any Budget

Pipelined Backpropagation at Scale: Training Large Models without Batches

Evaluation of pre-training large language models on leadership-class supercomputers

The Power of Training: How Different Neural Network Setups Influence the Energy Demand

Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints

Knowledge Distillation vs. Pretraining from Scratch under a Fixed (Computation) Budget

Exploiting Student Parallelism for Low-latency GPU Inference of BERT-like Models in Online Services

Efficient Large-Scale Language Model Training on GPU Clusters

M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining

Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models

Parallel Training of Pre-Trained Models Via Chunk-Based Dynamic Memory Management

M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining

Decentralized Training of Foundation Models in Heterogeneous Environments

Does your data spark joy? Performance gains from domain upsampling at the end of training