Performance Law of Large Language Models

Chuhan Wu,Ruiming Tang
2024-09-13
Abstract:Guided by the belief of the scaling law, large language models (LLMs) have achieved impressive performance in recent years. However, scaling law only gives a qualitative estimation of loss, which is influenced by various factors such as model architectures, data distributions, tokenizers, and computation precision. Thus, estimating the real performance of LLMs with different training settings rather than loss may be quite useful in practical development. In this article, we present an empirical equation named "Performance Law" to directly predict the MMLU score of an LLM, which is a widely used metric to indicate the general capability of LLMs in real-world conversations and applications. Based on only a few key hyperparameters of the LLM architecture and the size of training data, we obtain a quite accurate MMLU prediction of various LLMs with diverse sizes and architectures developed by different organizations in different years. Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources without extensive experiments.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to accurately predict the performance of large language models (LLMs) under different training settings. Specifically, the author proposes an empirical formula - "Performance Law" - for directly predicting the MMLU scores of LLMs. This is a widely - used metric that can reflect the comprehensive capabilities of LLMs in actual conversations and applications. This formula can perform relatively accurate performance predictions on LLMs of different scales and architectures with only a small number of key hyper - parameters (such as the number of model layers, the hidden layer size, the size of the intermediate layer of the feed - forward network) and the amount of training data. ### Main Problems and Solutions 1. **Limitations of Existing Methods**: - Existing scaling laws mainly focus on training loss, but the training loss of different models is affected by multiple factors, such as model architecture, data distribution, tokenizer, and computational precision. Therefore, it is difficult to be directly used for performance prediction. - The existing performance prediction methods have limited generalization ability on different model structures (such as dense or sparse) and shapes (such as wide or deep). 2. **Proposed New Method**: - **Performance Law**: By introducing the model instability discount \( u \) and the model saturation clipping \( T' \), a log - linear regression function is constructed to predict the MMLU score. The formula is as follows: \[ \text{MMLU} = w_1 \log(uN) + w_2 \log(uh) + w_3 \log(ud) + w_4 \log(uT') + b \] where \( w_1, w_2, w_3, w_4 \) and \( b \) are regression parameters, and \( u \) is the model instability discount, which is defined as: \[ u = e^{-\left[\left(\frac{10}{d}+\frac{20}{h}\right)\gamma N\right]^2} \] \( T' \) is the effective number of training tokens, which is defined as: \[ T'=\min(T, S) \] 3. **Extension to Mixture - of - Experts Models (MoE)**: - For MoE models, consider the number of activation parameters \( A \) and introduce an expansion factor \( g \) to adjust the performance prediction. The formula is as follows: \[ g=\left(\sqrt{\frac{A}{S}}\cdot A\right)^{\frac{1}{3}\cdot0.5}+\sqrt{\frac{A}{S}}\cdot\frac{1}{1 + e^{-\frac{A}{4}}} \] The final MMLU performance prediction formula for MoE models is: \[ \text{MMLU} = w_1 \log(u' Ng)+w_2 \log(u' hg)+w_3 \log(u' d)+w_4 \log(u' T)+b \] where \( u' \) is the modified model instability discount, which is defined as: \[ u' = e^{-\left[\left(\frac{10}{d'}+\frac{20}{h}\right)N\right]^2} \] ### Applications and Implications 1. **Predicting the Scaling Potential of LLMs**: - It can be used to predict the potential performance of the next - generation LLMs, for example, predicting the performance of a giant MoE model with 125 trillion parameters. 2. **Designing Appropriate Model Architectures**: - It helps developers choose appropriate model architectures under specific training cost budgets and inference efficiency requirements. 3. **Tracking the Health State of Models**: - By comparing the predicted values and the actual values, it helps developers detect abnormal situations in model training at an early stage. 4. **Planning the Scaling of Dense Models**: