Nanolm: an Affordable LLM Pre-training Benchmark Via Accurate Loss Prediction Across Scales

Yiqun Yao,Siqi fan,Xiusheng Huang,Xuezhi Fang,Xiang Li,Ziyi Ni,Xin Jiang,Xuying Meng,Peng Han,Shuo Shang,Kang Liu,Aixin Sun,Yequan Wang
2023-01-01
Abstract:As language models scale up, it becomes increasingly expensive to verifyresearch ideas because conclusions on small models do not trivially transfer tolarge ones. A possible solution is to establish a generic system thataccurately predicts certain metrics for large models without training them.Existing scaling laws require hyperparameter search on the largest models,limiting their predicative capability. In this paper, we present an approach(namely μScaling) to predict the pre-training loss, based on ourobservations that Maximal Update Parametrization (μP) enables accuratefitting of scaling laws close to common loss basins in hyperparameter space.With μScaling, different model designs can be compared on large scales bytraining only their smaller counterparts. Further, we introduce nanoLM: anaffordable LLM pre-training benchmark that facilitates this new researchparadigm. With around 14forecast the loss for models up to 52B. Our goal with nanoLM is to empowerresearchers with limited resources to reach meaningful conclusions on largemodels. We also aspire for our benchmark to serve as a bridge between theacademic community and the industry. Code for μScaling is available athttps://github.com/cofe-ai/Mu-scaling. Code for nanoLLM will be availablelater.
What problem does this paper attempt to address?