Algorithmic progress in language models

Anson Ho,Tamay Besiroglu,Ege Erdil,David Owen,Robi Rahman,Zifan Carl Guo,David Atkinson,Neil Thompson,Jaime Sevilla
2024-03-09
Abstract:We investigate the rate at which algorithms for pre-training language models have improved since the advent of deep learning. Using a dataset of over 200 language model evaluations on Wikitext and Penn Treebank spanning 2012-2023, we find that the compute required to reach a set performance threshold has halved approximately every 8 months, with a 95% confidence interval of around 5 to 14 months, substantially faster than hardware gains per Moore's Law. We estimate augmented scaling laws, which enable us to quantify algorithmic progress and determine the relative contributions of scaling models versus innovations in training algorithms. Despite the rapid pace of algorithmic progress and the development of new architectures such as the transformer, our analysis reveals that the increase in compute made an even larger contribution to overall performance improvements over this time period. Though limited by noisy benchmark data, our analysis quantifies the rapid progress in language modeling, shedding light on the relative contributions from compute and algorithms.
Artificial Intelligence
What problem does this paper attempt to address?
This paper discusses the progress of language modeling algorithms and examines the rate at which pre-training algorithms have improved since the emergence of deep learning. By analyzing data from over 200 language model evaluations from 2012 to 2023, it was found that the required computational power to achieve a certain level of performance decreases by approximately half every 8 months, which is faster than the hardware improvement rate stipulated by Moore's Law. The paper estimates an enhanced scale law, quantifies algorithmic advancements, and compares the relative contributions of model scaling and training algorithm innovation. Despite rapid algorithmic advancements and the emergence of new architectures such as Transformer, the analysis shows that the increase in computational power during this period has contributed more to overall performance improvement. The study quantifies the progress of language modeling and reveals the relative contributions of computation and algorithms.