Abstract:The scientific scale-up of large language models (LLMs) necessitates a comprehensive understanding of their scaling properties. However, the existing literature on the scaling properties only yields an incomplete answer: optimization loss decreases predictably as the model size increases, in line with established scaling law; yet no scaling law for task has been established and the task performances are far from predictable during scaling. Task performances typically show minor gains on small models until they improve dramatically once models exceed a size threshold, exemplifying the \`\`emergent abilities''. In this study, we discover that small models, although they exhibit minor performance, demonstrate critical and consistent task performance improvements that are not captured by conventional evaluation strategies due to insufficient measurement resolution. To measure such improvements, we introduce \textsc{PassUntil}, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase. With \textsc{PassUntil}, we conduct a quantitative investigation into the scaling law of task performance. The investigation contains two parts. Firstly, a strict \textsl{task scaling law} that is not conventionally known to exist, is identified, enhancing the predictability of task performances. Remarkably, we are able to predict the performance of the 2.4B model on code generation with merely 0.05\% deviation before training starts, which is the first systematic attempt to verify predictable scaling proposed by GPT-4's report. Secondly, underpinned by \textsc{PassUntil}, we observe concrete evidence of emergent abilities and ascertain that they are not in conflict with the continuity of performance improvement. Their semblance to break-through is that their scaling curve cannot be fitted by standard scaling law function. We then introduce a mathematical definition for the emergent abilities. Through the definition, we refute a prevalent ``multi-step reasoning hypothesis'' regarding the genesis of emergent abilities and propose a new hypothesis with a satisfying fit to the observed scaling curve.

Inverse scaling can become U-shaped

Inverse Scaling: When Bigger Isn't Better

U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models

Observational Scaling Laws and the Predictability of Language Model Performance

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments

Unlock Predictable Scaling from Emergent Abilities

Scaling Laws Do Not Scale

Language models scale reliably with over-training and on downstream tasks

Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language Models

Is the Number of Trainable Parameters All That Actually Matters?

Scaling Laws for Precision

More Compute Is What You Need

Scaling Laws for Neural Language Models

Larger and more instructable language models become less reliable

Revisiting Neural Scaling Laws in Language and Vision

A Dynamical Model of Neural Scaling Laws

Establishing Task Scaling Laws via Compute-Efficient Model Ladders

Scaling laws for language encoding models in fMRI

A Solvable Model of Neural Scaling Laws

A Hitchhiker's Guide to Scaling Law Estimation

Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale