Auto-Regressive Next-Token Predictors are Universal Learners

Eran Malach
2024-07-30
Abstract:Large language models display remarkable capabilities in logical and mathematical reasoning, allowing them to solve complex tasks. Interestingly, these abilities emerge in networks trained on the simple task of next-token prediction. In this work, we present a theoretical framework for studying auto-regressive next-token predictors. We demonstrate that even simple models such as linear next-token predictors, trained on Chain-of-Thought (CoT) data, can approximate any function efficiently computed by a Turing machine. We introduce a new complexity measure -- length complexity -- which measures the number of intermediate tokens in a CoT sequence required to approximate some target function, and analyze the interplay between length complexity and other notions of complexity. Finally, we show experimentally that simple next-token predictors, such as linear networks and shallow Multi-Layer Perceptrons (MLPs), display non-trivial performance on text generation and arithmetic tasks. Our results demonstrate that the power of today's LLMs can be attributed, to a great extent, to the auto-regressive next-token training scheme, and not necessarily to a particular choice of architecture.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to explore the capabilities of auto - regressive next - token predictors in logical and mathematical reasoning. Specifically, the paper attempts to answer the following questions: 1. **Are auto - regressive next - token predictors merely advanced auto - completion models, or are they capable of true logical reasoning?** - By introducing the Chain - of - Thought (CoT) technique, the authors show that these models can not only memorize large amounts of data but also perform complex logical reasoning. 2. **Is the ability of these models mainly attributed to the auto - regressive training scheme rather than specific architecture choices?** - Through theoretical analysis and experimental verification, the authors prove that even simple linear models can efficiently approximate any Turing - machine - computable function through auto - regressive training. 3. **How does auto - regressive learning differ from traditional supervised learning, and why can auto - regressive learning more effectively learn complex functions?** - The authors point out that in auto - regressive learning, the model can use intermediate tokens for supervision, which greatly simplifies the learning task. ### Main contributions 1. **Theoretical framework**: - The authors propose a theoretical framework for studying auto - regressive next - token predictors and prove that simple models (such as linear models) can efficiently approximate any Turing - machine - computable function during training. 2. **Length complexity**: - A new complexity measure, length complexity, is introduced to measure the number of intermediate tokens required for a model to learn a certain concept class. The authors analyze the relationship between length complexity and other complexity measures (such as sample complexity and running - time complexity). 3. **Experimental verification**: - The non - trivial performance of simple models (such as linear networks and shallow multi - layer perceptrons) in text generation and arithmetic tasks is verified through experiments. For example, a shallow multi - layer perceptron can correctly perform four - digit multiplication given Chain - of - Thought data. ### Conclusion The main conclusion of the paper is that the powerful capabilities of modern large - language models in logical reasoning are largely attributed to the auto - regressive next - token training scheme, rather than just specific architecture choices. By introducing the Chain - of - Thought technique, even simple models can efficiently solve complex tasks. This finding provides a new perspective for understanding the potential of auto - regressive learning and lays the foundation for future theoretical and experimental research.