Abstract:Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Our approaches converge on the conclusion that the optimal vocabulary size depends on the compute budget, with larger models requiring larger vocabularies. Most LLMs, however, use insufficient vocabulary sizes. For example, we predict that the optimal vocabulary size of Llama2-70B should have been at least 216K, 7 times larger than its vocabulary of 32K. We validate our predictions empirically by training models with 3B parameters across different FLOPs budgets. Adopting our predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes. By increasing the vocabulary size from the conventional 32K to 43K, we improve performance on ARC-Challenge from 29.1 to 32.0 with the same 2.3e21 FLOPs. Our work highlights the importance of jointly considering tokenization and model scaling for efficient pre-training. The code and demo are available at <a class="link-external link-https" href="https://github.com/sail-sg/scaling-with-vocab" rel="external noopener nofollow">this https URL</a> and <a class="link-external link-https" href="https://hf.co/spaces/sail/scaling-with-vocab-demo" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: large - language models (LLMs) overlook the influence of vocabulary size during the expansion process. Existing research mainly focuses on the number of model parameters and the amount of training data, while ignoring the impact of vocabulary size on model performance. Through a series of experiments and analyses, the author explores how to determine the optimal vocabulary size under a given computational budget and proposes three prediction methods to estimate this optimal value. ### Main research questions: 1. **The influence of vocabulary size on the model expansion law**: Research how vocabulary size affects the expansion law of large - language models. 2. **Prediction of the optimal vocabulary size**: Propose three methods to predict the optimal vocabulary size under a given computational budget. 3. **Whether the vocabulary configuration of existing models is reasonable**: Evaluate whether the vocabulary configurations of currently popular large - language models are reasonable and put forward improvement suggestions. ### Research background: - **Expansion law**: Existing research on expansion laws mainly focuses on the number of floating - point operations (FLOPs), the number of model parameters, and the amount of training data, but ignores the influence of vocabulary size. - **Variability of vocabulary size**: The vocabulary sizes of different models vary greatly. For example, Llama2 - 7B uses a 32K vocabulary, while Gemma - 7B uses a 256K vocabulary, although their total number of parameters is similar. ### Research methods: 1. **Normalized loss function**: In order to fairly compare the performance of models with different vocabulary sizes, a normalized loss function is introduced. 2. **Three prediction methods**: - **Method 1 (Power - law fitting based on IsoFLOPs)**: By training models with different vocabulary configurations, fit the power - law relationships between the computational budget and non - vocabulary parameters, vocabulary parameters, and training data. - **Method 2 (Derivative - based estimation)**: By calculating the derivative of FLOPs with respect to vocabulary size, find the vocabulary size that minimizes FLOPs. - **Method 3 (Parameter fitting of the loss function)**: Modify the Chinchilla expansion law, combine non - vocabulary parameters, vocabulary parameters, and the number of training characters to predict the normalized loss function. ### Experimental results: - **Prediction of the optimal vocabulary size**: The prediction results of the three methods are consistent, indicating that the optimal vocabulary size depends on the computational budget, and as the model scale increases, the optimal vocabulary size should also increase accordingly. - **Deficiencies of existing models**: Most existing large - language models use insufficient vocabulary sizes. For example, the optimal vocabulary size of Llama2 - 70B should be at least 216K, rather than the current 32K. - **Empirical verification**: Through experimental verification, using the predicted optimal vocabulary size can significantly improve the performance of downstream tasks. ### Conclusions: - **The importance of vocabulary size**: Vocabulary size has an important impact on the performance of large - language models and should not be ignored. - **Joint consideration of vocabulary, model parameters, and training data**: In order to perform efficient pre - training, it is necessary to comprehensively consider the configurations of vocabulary size, model parameters, and training data. - **Improvement of existing models**: It is recommended that existing large - language models increase their vocabulary sizes to improve performance. ### References: - Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. - Hoffmann, J., et al. (2022). Training Compute - Optimal Large Language Models. - Chinchilla, et al. (2022). Scaling Laws for Autoregressive Generative Modeling. These studies provide new perspectives and methods for understanding and optimizing the expansion laws of large - language models.

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Language models scale reliably with over-training and on downstream tasks

Scaling Laws for Multilingual Language Models

Scaling Law for Language Models Training Considering Batch Size

Large Vocabulary Size Improves Large Language Models

Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

Temporal Scaling Law for Large Language Models

Scaling Laws for Linear Complexity Language Models

Observational Scaling Laws and the Predictability of Language Model Performance

Scaling Laws for Neural Language Models

When Do We Not Need Larger Vision Models?

Inverse Scaling: When Bigger Isn't Better

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?

Revisiting Neural Scaling Laws in Language and Vision

Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Scaling Parameter-Constrained Language Models with Quality Data

Scaling-laws for Large Time-series Models