Predicting Emergent Abilities with Infinite Resolution Evaluation

Shengding Hu,Xin Liu,Xu Han,Xinrong Zhang,Chaoqun He,Weilin Zhao,Yankai Lin,Ning Ding,Zebin Ou,Guoyang Zeng,Zhiyuan Liu,Maosong Sun

2024-04-17

Abstract:The scientific scale-up of large language models (LLMs) necessitates a comprehensive understanding of their scaling properties. However, the existing literature on the scaling properties only yields an incomplete answer: optimization loss decreases predictably as the model size increases, in line with established scaling law; yet no scaling law for task has been established and the task performances are far from predictable during scaling. Task performances typically show minor gains on small models until they improve dramatically once models exceed a size threshold, exemplifying the ``emergent abilities''. In this study, we discover that small models, although they exhibit minor performance, demonstrate critical and consistent task performance improvements that are not captured by conventional evaluation strategies due to insufficient measurement resolution. To measure such improvements, we introduce PassUntil, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase. With PassUntil, we conduct a quantitative investigation into the scaling law of task performance. The investigation contains two parts. Firstly, a strict task scaling law that is not conventionally known to exist, is identified, enhancing the predictability of task performances. Remarkably, we are able to predict the performance of the 2.4B model on code generation with merely 0.05\% deviation before training starts, which is the first systematic attempt to verify predictable scaling proposed by GPT-4's report. Secondly, we are able to study emergent abilities quantitatively. We identify a kind of accelerated emergence whose scaling curve cannot be fitted by standard scaling law function and has a increasing speed. We then examine two hypothesis and imply that the ``multiple circuits hypothesis'' might be responsible for the accelerated emergence.

Computation and Language

What problem does this paper attempt to address?

This paper explores the impact of scaling up large language models (LLMs) on their task performance. Currently, while there is a predictable scaling relationship between model size and optimization loss, the scaling law of task performance, especially the emergence of "new capabilities", is not yet clear. The study found that small models performed poorly until a certain threshold was reached, after which there was a significant improvement. To address this issue, they proposed an evaluation strategy called PASSUNTIL, which increased the evaluation resolution through extensive sampling to quantitatively study the scaling law of task performance. The results indicate that task performance can be predicted and identified an accelerating phenomenon of "new capabilities" whose growth rate cannot be described by the standard scaling law function. In addition, they ruled out certain assumptions and proposed an explanation based on latent transformation circuits. This work provides the first open attempt at predicting task performance.

Predicting Emergent Abilities with Infinite Resolution Evaluation

Unlock Predictable Scaling from Emergent Abilities

Scaling Graph Neural Networks for Large-Scale Power Systems Analysis: Empirical Laws for Emergent Abilities

Predicting Emergent Capabilities by Finetuning

U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models

Observational Scaling Laws and the Predictability of Language Model Performance

Establishing Task Scaling Laws via Compute-Efficient Model Ladders

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments

Are Emergent Abilities of Large Language Models a Mirage?

Predictable Emergent Abilities of LLMs: Proxy Tasks Are All You Need

Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Language models scale reliably with over-training and on downstream tasks

Understanding Emergent Abilities of Language Models from the Loss Perspective

Scaling Laws for Predicting Downstream Performance in LLMs

Emergent Abilities in Reduced-Scale Generative Language Models

Has LLM Reached the Scaling Ceiling Yet? Unified Insights into LLM Regularities and Constraints

Collaborative Performance Prediction for Large Language Models

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

Are Emergent Abilities in Large Language Models just In-Context Learning?

LLMs learn governing principles of dynamical systems, revealing an in-context neural scaling law

Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve