Scaling Laws for Downstream Task Performance of Large Language Models

Berivan Isik,Natalia Ponomareva,Hussein Hazimeh,Dimitris Paparas,Sergei Vassilvitskii,Sanmi Koyejo

2024-02-07

Abstract:Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a downstream task, we often also care about the downstream performance. In this work, we study the scaling behavior in a transfer learning setting, where LLMs are finetuned for machine translation tasks. Specifically, we investigate how the choice of the pretraining data and its size affect downstream performance (translation quality) as judged by two metrics: downstream cross-entropy and BLEU score. Our experiments indicate that the size of the finetuning dataset and the distribution alignment between the pretraining and downstream data significantly influence the scaling behavior. With sufficient alignment, both downstream cross-entropy and BLEU score improve monotonically with more pretraining data. In such cases, we show that it is possible to predict the downstream BLEU score with good accuracy using a log-law. However, there are also cases where moderate misalignment causes the BLEU score to fluctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. By analyzing these observations, we provide new practical insights for choosing appropriate pretraining data.

Machine Learning,Computation and Language

What problem does this paper attempt to address?

This paper explores the scale law of large-scale language models on the performance of downstream tasks, particularly focusing on the influence of pre-training data scale on machine translation task performance. The study found that the amount of pre-training data and the alignment between it and the downstream task distribution significantly affect performance, and under good alignment, the BLEU score and downstream cross-entropy loss monotonically increase. When the alignment is insufficient, the BLEU score may vary non-monotonically, while the downstream cross-entropy still monotonically decreases. The paper proposes a logarithmic law for predicting BLEU score and warns that using cross-entropy as a surrogate for task-related metrics such as BLEU score may be misleading. In addition, the paper provides practical guidelines for evaluating the value of pre-training data.

Scaling Laws for Downstream Task Performance of Large Language Models

Scaling Laws for Predicting Downstream Performance in LLMs

Scaling Laws for Multilingual Language Models

Scaling Laws for Multilingual Neural Machine Translation

Establishing Task Scaling Laws via Compute-Efficient Model Ladders

Temporal Scaling Law for Large Language Models

Scaling Law for Language Models Training Considering Batch Size

Inverse Scaling: When Bigger Isn't Better

Scaling Laws for Neural Language Models

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments

A Hitchhiker's Guide to Scaling Law Estimation

The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis

Scaling Laws for Transfer

Scaling laws for post-training quantized large language models

Language models scale reliably with over-training and on downstream tasks

Collaborative Performance Prediction for Large Language Models

Selecting Large Language Model to Fine-tune via Rectified Scaling Law

Scaling Laws for Neural Machine Translation

When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method

Observational Scaling Laws and the Predictability of Language Model Performance