Abstract:The increasing presence of AI-generated content on the internet raises a critical question: What happens when generative machine learning models are pretrained on web-scale datasets containing data created by earlier models? Some authors prophesy $\textit{model collapse}$ under a "$\textit{replace}$" scenario: a sequence of models, the first trained with real data and each later one trained only on synthetic data from its preceding model. In this scenario, models successively degrade. Others see collapse as easily avoidable; in an "$\textit{accumulate}$' scenario, a sequence of models is trained, but each training uses all real and synthetic data generated so far. In this work, we deepen and extend the study of these contrasting scenarios. First, collapse versus avoidance of collapse is studied by comparing the replace and accumulate scenarios on each of three prominent generative modeling settings; we find the same contrast emerges in all three settings. Second, we study a compromise scenario; the available data remains the same as in the accumulate scenario -- but unlike $\textit{accumulate}$ and like $\textit{replace}$, each model is trained using a fixed compute budget; we demonstrate that model test loss on real data is larger than in the $\textit{accumulate}$ scenario, but apparently plateaus, unlike the divergence seen with $\textit{replace}$. Third, we study the relative importance of cardinality and proportion of real data for avoiding model collapse. Surprisingly, we find a non-trivial interaction between real and synthetic data, where the value of synthetic data for reducing test loss depends on the absolute quantity of real data. Our insights are particularly important when forecasting whether future frontier generative models will collapse or thrive, and our results open avenues for empirically and mathematically studying the context-dependent value of synthetic data.

A Tale of Tails: Model Collapse as a Change of Scaling Laws

Strong Model Collapse

Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement

Model Collapse Demystified: The Case of Regression

Scaling Graph Neural Networks for Large-Scale Power Systems Analysis: Empirical Laws for Emergent Abilities

How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse

Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

Scaling Laws for Neural Language Models

Observational Scaling Laws and the Predictability of Language Model Performance

A Solvable Model of Neural Scaling Laws

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments

A Dynamical Model of Neural Scaling Laws

An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws

AI models collapse when trained on recursively generated data

Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

Linguistic Collapse: Neural Collapse in (Large) Language Models

Scaling Laws Do Not Scale

Towards Neural Scaling Laws for Time Series Foundation Models

Explaining Neural Scaling Laws

Revisiting Neural Scaling Laws in Language and Vision