Abstract:Guided by the belief of the scaling law, large language models (LLMs) have achieved impressive performance in recent years. However, scaling law only gives a qualitative estimation of loss, which is influenced by various factors such as model architectures, data distributions, tokenizers, and computation precision. Thus, estimating the real performance of LLMs with different training settings rather than loss may be quite useful in practical development. In this article, we present an empirical equation named "Performance Law" to directly predict the MMLU score of an LLM, which is a widely used metric to indicate the general capability of LLMs in real-world conversations and applications. Based on only a few key hyperparameters of the LLM architecture and the size of training data, we obtain a quite accurate MMLU prediction of various LLMs with diverse sizes and architectures developed by different organizations in different years. Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources without extensive experiments.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to accurately predict the performance of large language models (LLMs) under different training settings. Specifically, the author proposes an empirical formula - "Performance Law" - for directly predicting the MMLU scores of LLMs. This is a widely - used metric that can reflect the comprehensive capabilities of LLMs in actual conversations and applications. This formula can perform relatively accurate performance predictions on LLMs of different scales and architectures with only a small number of key hyper - parameters (such as the number of model layers, the hidden layer size, the size of the intermediate layer of the feed - forward network) and the amount of training data. ### Main Problems and Solutions 1. **Limitations of Existing Methods**: - Existing scaling laws mainly focus on training loss, but the training loss of different models is affected by multiple factors, such as model architecture, data distribution, tokenizer, and computational precision. Therefore, it is difficult to be directly used for performance prediction. - The existing performance prediction methods have limited generalization ability on different model structures (such as dense or sparse) and shapes (such as wide or deep). 2. **Proposed New Method**: - **Performance Law**: By introducing the model instability discount \( u \) and the model saturation clipping \( T' \), a log - linear regression function is constructed to predict the MMLU score. The formula is as follows: \[ \text{MMLU} = w_1 \log(uN) + w_2 \log(uh) + w_3 \log(ud) + w_4 \log(uT') + b \] where \( w_1, w_2, w_3, w_4 \) and \( b \) are regression parameters, and \( u \) is the model instability discount, which is defined as: \[ u = e^{-\left[\left(\frac{10}{d}+\frac{20}{h}\right)\gamma N\right]^2} \] \( T' \) is the effective number of training tokens, which is defined as: \[ T'=\min(T, S) \] 3. **Extension to Mixture - of - Experts Models (MoE)**: - For MoE models, consider the number of activation parameters \( A \) and introduce an expansion factor \( g \) to adjust the performance prediction. The formula is as follows: \[ g=\left(\sqrt{\frac{A}{S}}\cdot A\right)^{\frac{1}{3}\cdot0.5}+\sqrt{\frac{A}{S}}\cdot\frac{1}{1 + e^{-\frac{A}{4}}} \] The final MMLU performance prediction formula for MoE models is: \[ \text{MMLU} = w_1 \log(u' Ng)+w_2 \log(u' hg)+w_3 \log(u' d)+w_4 \log(u' T)+b \] where \( u' \) is the modified model instability discount, which is defined as: \[ u' = e^{-\left[\left(\frac{10}{d'}+\frac{20}{h}\right)N\right]^2} \] ### Applications and Implications 1. **Predicting the Scaling Potential of LLMs**: - It can be used to predict the potential performance of the next - generation LLMs, for example, predicting the performance of a giant MoE model with 125 trillion parameters. 2. **Designing Appropriate Model Architectures**: - It helps developers choose appropriate model architectures under specific training cost budgets and inference efficiency requirements. 3. **Tracking the Health State of Models**: - By comparing the predicted values and the actual values, it helps developers detect abnormal situations in model training at an early stage. 4. **Planning the Scaling of Dense Models**:

Performance Law of Large Language Models

Scaling Laws for Predicting Downstream Performance in LLMs

Densing Law of LLMs

Temporal Scaling Law for Large Language Models

Collaborative Performance Prediction for Large Language Models

Scaling Law for Language Models Training Considering Batch Size

Large Language Models are legal but they are not: Making the case for a powerful LegalLLM

Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models

Scaling Laws for Multilingual Language Models

LLMs learn governing principles of dynamical systems, revealing an in-context neural scaling law

A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Scaling Laws for Discriminative Classification in Large Language Models

Observational Scaling Laws and the Predictability of Language Model Performance

Fine-tuning and Application of Large Language Model in Law Domain

LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods

CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

Scaling Laws for Linear Complexity Language Models

Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families

MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications

Scaling Laws for Downstream Task Performance of Large Language Models