Abstract:LLMs are statistical models of language learning through stochastic gradient descent with a next token prediction objective. Prompting a popular view among AI modelers: LLMs are just next token predictors. While LLMs are engineered using next token prediction, and trained based on their success at this task, our view is that a reduction to just next token predictor sells LLMs short. Moreover, there are important explanations of LLM behavior and capabilities that are lost when we engage in this kind of reduction. In order to draw this out, we will make an analogy with a once prominent research program in biology explaining evolution and development from the gene's eye view.

What problem does this paper attempt to address?

The paper attempts to address whether the simplified view of large language models (LLMs) as mere "next word predictors" is appropriate. Specifically, it explores whether it is reasonable to consider LLMs solely as "next word predictors" and points out that this simplified view may overlook the complexity and versatility of LLMs in practical applications. ### Main Points of the Paper: 1. **LLMs are more than just next word predictors**: - Although LLMs do primarily rely on the next word prediction task during training, their actual functions go far beyond this. - LLMs can generate coherent sentences, paragraphs, and even entire articles, which surpasses the simple next word prediction. 2. **Limitations of the simplified view**: - Viewing LLMs merely as next word predictors ignores their various capabilities and behaviors in practical applications. - This simplified view cannot explain why certain LLMs perform exceptionally well on specific tasks, such as answering questions, providing suggestions, telling jokes, etc. 3. **Perspective on functional changes**: - The paper uses an analogy from biology, the "gene perspective," to illustrate functional changes. For example, Play-Doh was originally designed as a wallpaper cleaner but was later repurposed as a toy. - Similarly, although LLMs are trained with the goal of next word prediction, their functions have expanded to more areas in practical applications. 4. **Increase and decrease in explanatory power**: - The paper also discusses the insufficiency of the simplified view in terms of explanatory power. For instance, focusing solely on gene-level evolution overlooks the functional organization of the entire genome. - Similarly, focusing only on the next word prediction function of LLMs ignores the complex associative networks they rely on to generate coherent text. ### Conclusion of the Paper: - LLMs are not just next word predictors; through complex associative networks and multi-task training, they possess various advanced functions. - The simplified view not only limits our understanding of LLMs but may also affect our assessment of their potential risks and ethical issues. - Therefore, we need to adopt higher-level descriptions and explanatory methods to better understand the capabilities and behaviors of LLMs. Overall, the paper aims to correct the view of simplifying LLMs as "next word predictors" and emphasizes their complexity and versatility in practical applications.

LLMs are Not Just Next Token Predictors

A Law of Next-Token Prediction in Large Language Models

Beyond the Black Box: A Statistical Model for LLM Reasoning and Inference

No Such Thing as a General Learner: Language models and their dual optimization

Predictive Minds: LLMs As Atypical Active Inference Agents

Let your LLM generate a few tokens and you will reduce the need for retrieval

Stochastic LLMs do not Understand Language: Towards Symbolic, Explainable and Ontologically Based LLMs

Bayesian Statistical Modeling with Predictors from LLMs

Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve

Embers of autoregression show how large language models are shaped by the problem they are trained to solve

Logistic Regression makes small LLMs strong and explainable "tens-of-shot" classifiers

Eight Things to Know about Large Language Models

A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners

Is Next Token Prediction Sufficient for GPT? Exploration on Code Logic Comprehension

Unused information in token probability distribution of generative LLM: improving LLM reading comprehension through calculation of expected values

Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative LLMs Reflect Lexical Semantics

Bridging the Training-Inference Gap in LLMs by Leveraging Self-Generated Tokens

LLMs' Understanding of Natural Language Revealed

Regression-aware Inference with LLMs

A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

Where is the signal in tokenization space?