Abstract:Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Our approaches converge on the conclusion that the optimal vocabulary size depends on the compute budget, with larger models requiring larger vocabularies. Most LLMs, however, use insufficient vocabulary sizes. For example, we predict that the optimal vocabulary size of Llama2-70B should have been at least 216K, 7 times larger than its vocabulary of 32K. We validate our predictions empirically by training models with 3B parameters across different FLOPs budgets. Adopting our predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes. By increasing the vocabulary size from the conventional 32K to 43K, we improve performance on ARC-Challenge from 29.1 to 32.0 with the same 2.3e21 FLOPs. Our work highlights the importance of jointly considering tokenization and model scaling for efficient pre-training. The code and demo are available at <a class="link-external link-https" href="https://github.com/sail-sg/scaling-with-vocab" rel="external noopener nofollow">this https URL</a> and <a class="link-external link-https" href="https://hf.co/spaces/sail/scaling-with-vocab-demo" rel="external noopener nofollow">this https URL</a>.

An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws

Scaling Laws for Neural Language Models

Observational Scaling Laws and the Predictability of Language Model Performance

Language models scale reliably with over-training and on downstream tasks

A Theory for Emergence of Complex Skills in Language Models

A Solvable Model of Neural Scaling Laws

A Mathematical Theory for Learning Semantic Languages by Abstract Learners

The Quantization Model of Neural Scaling

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Evaluating Computational Language Models with Scaling Properties of Natural Language

The Information of Large Language Model Geometry

Inverse Scaling: When Bigger Isn't Better

Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Is the Number of Trainable Parameters All That Actually Matters?

Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling

Scaling Laws for Multilingual Language Models

U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models