LLMs are Not Just Next Token Predictors

Stephen M. Downes,Patrick Forber,Alex Grzankowski
2024-08-07
Abstract:LLMs are statistical models of language learning through stochastic gradient descent with a next token prediction objective. Prompting a popular view among AI modelers: LLMs are just next token predictors. While LLMs are engineered using next token prediction, and trained based on their success at this task, our view is that a reduction to just next token predictor sells LLMs short. Moreover, there are important explanations of LLM behavior and capabilities that are lost when we engage in this kind of reduction. In order to draw this out, we will make an analogy with a once prominent research program in biology explaining evolution and development from the gene's eye view.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address whether the simplified view of large language models (LLMs) as mere "next word predictors" is appropriate. Specifically, it explores whether it is reasonable to consider LLMs solely as "next word predictors" and points out that this simplified view may overlook the complexity and versatility of LLMs in practical applications. ### Main Points of the Paper: 1. **LLMs are more than just next word predictors**: - Although LLMs do primarily rely on the next word prediction task during training, their actual functions go far beyond this. - LLMs can generate coherent sentences, paragraphs, and even entire articles, which surpasses the simple next word prediction. 2. **Limitations of the simplified view**: - Viewing LLMs merely as next word predictors ignores their various capabilities and behaviors in practical applications. - This simplified view cannot explain why certain LLMs perform exceptionally well on specific tasks, such as answering questions, providing suggestions, telling jokes, etc. 3. **Perspective on functional changes**: - The paper uses an analogy from biology, the "gene perspective," to illustrate functional changes. For example, Play-Doh was originally designed as a wallpaper cleaner but was later repurposed as a toy. - Similarly, although LLMs are trained with the goal of next word prediction, their functions have expanded to more areas in practical applications. 4. **Increase and decrease in explanatory power**: - The paper also discusses the insufficiency of the simplified view in terms of explanatory power. For instance, focusing solely on gene-level evolution overlooks the functional organization of the entire genome. - Similarly, focusing only on the next word prediction function of LLMs ignores the complex associative networks they rely on to generate coherent text. ### Conclusion of the Paper: - LLMs are not just next word predictors; through complex associative networks and multi-task training, they possess various advanced functions. - The simplified view not only limits our understanding of LLMs but may also affect our assessment of their potential risks and ethical issues. - Therefore, we need to adopt higher-level descriptions and explanatory methods to better understand the capabilities and behaviors of LLMs. Overall, the paper aims to correct the view of simplifying LLMs as "next word predictors" and emphasizes their complexity and versatility in practical applications.