Abstract:Large language models (LLMs) have been widely employed across various application domains, yet their black-box nature poses significant challenges to understanding how these models process input data internally to make predictions. In this paper, we introduce a precise and quantitative law that governs the learning of contextualized token embeddings through intermediate layers in pre-trained LLMs for next-token prediction. Our findings reveal that each layer contributes equally to enhancing prediction accuracy, from the lowest to the highest layer -- a universal phenomenon observed across a diverse array of open-source LLMs, built on architectures such as Transformer, RWKV, and Mamba. We demonstrate that this law offers new perspectives and insights to inform and guide practices in LLM development and applications, including model scaling, pre-training tasks, and information flow. Overall, our law enables more fine-grained approaches to the design, training, and interpretation of LLMs through scrutinizing their internal data processing mechanisms.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to understand how large language models (LLMs) internally process input data for prediction, especially the internal working principles of these models when predicting the next word. Although LLMs have been widely used in multiple application fields, their black - box nature makes it difficult to understand and explain their internal data - processing mechanisms. Specifically, this paper aims to reveal the rules of LLMs learning contextualized token embeddings through intermediate layers during pre - training. The author introduces an accurate and quantitative law - the law of equi - learning - to describe this process. This law indicates that, in the process from the lowest layer to the highest layer, each layer contributes approximately equally to improving prediction accuracy, and this phenomenon is普遍存在 in a variety of open - source LLMs, including models based on Transformer, RWKV and Mamba architectures. ### Core content of the law of equi - learning According to the law of equi - learning, the ability of LLMs to predict the next token increases exponentially, and each layer improves the prediction ability by approximately the same multiplicative factor. The specific formula is as follows: \[ PR_l \approx \rho^{l - 1} \times PR_1 \] where: - \( PR_l \) represents the prediction residual of the \( l \) - th layer. - \( \rho \) is a decay ratio, satisfying \( 0 < \rho < 1 \). - \( PR_1 \) is the prediction residual of the first layer. The prediction residual \( PR \) is defined as: \[ PR=\frac{\sum (x_{\text{next}}-\hat{x}_{\text{next}})^2}{\sum (x_{\text{next}}-\bar{x}_{\text{next}})^2} \] where: - \( x_{\text{next}} \) is the actual next token. - \( \hat{x}_{\text{next}} = w \cdot h + b \) is the predicted next token. - \( \bar{x}_{\text{next}} \) is the average of all \( x_{\text{next}} \). ### Main findings of the paper 1. **Equal contribution of each layer**: Each layer contributes approximately equally to the improvement of prediction ability, which challenges the previous view that some layers are more important than others. 2. **Universality**: This law has been observed in LLMs of various architectures and scales, and has wide applicability. 3. **Training dynamics**: The research also explored the influence of factors such as training steps, training rounds and data repetition on the law of equi - learning, and found that a sufficient total number of tokens is helpful for the emergence of this law. 4. **Model expansion**: As the model scale increases, the law of equi - learning provides a more detailed understanding, going beyond the traditional method that only relies on test loss. ### Practical applications This law provides a new perspective and guidance for the development and application of LLM, including model expansion, pre - training task selection and information flow control. For example, it can help optimize the training process, improve the transparency and interpretability of the model, and thus better realize the potential of LLMs. In conclusion, this paper reveals important laws in the internal work of LLMs by introducing the law of equi - learning, providing a theoretical basis for in - depth understanding of these complex models.

A Law of Next-Token Prediction in Large Language Models

LLMs are Not Just Next Token Predictors

Better & Faster Large Language Models via Multi-token Prediction

Beyond the Black Box: A Statistical Model for LLM Reasoning and Inference

Why Larger Language Models Do In-context Learning Differently?

Is Next Token Prediction Sufficient for GPT? Exploration on Code Logic Comprehension

Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models

Temporal Scaling Law for Large Language Models

Performance Law of Large Language Models

Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations

Explainability for Large Language Models: A Survey

Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve

Large Language Models Are Zero-Shot Time Series Forecasters

SentenceVAE: Enable Next-sentence Prediction for Large Language Models with Faster Speed, Higher Accuracy and Longer Context

Supervised Knowledge Makes Large Language Models Better In-context Learners

Mechanics of Next Token Prediction with Self-Attention

The Information of Large Language Model Geometry

Eight Things to Know about Large Language Models

How do Large Language Models Handle Multilingualism?

Auto-Regressive Next-Token Predictors are Universal Learners

LLMs learn governing principles of dynamical systems, revealing an in-context neural scaling law