Abstract:Large language models display remarkable capabilities in logical and mathematical reasoning, allowing them to solve complex tasks. Interestingly, these abilities emerge in networks trained on the simple task of next-token prediction. In this work, we present a theoretical framework for studying auto-regressive next-token predictors. We demonstrate that even simple models such as linear next-token predictors, trained on Chain-of-Thought (CoT) data, can approximate any function efficiently computed by a Turing machine. We introduce a new complexity measure -- length complexity -- which measures the number of intermediate tokens in a CoT sequence required to approximate some target function, and analyze the interplay between length complexity and other notions of complexity. Finally, we show experimentally that simple next-token predictors, such as linear networks and shallow Multi-Layer Perceptrons (MLPs), display non-trivial performance on text generation and arithmetic tasks. Our results demonstrate that the power of today's LLMs can be attributed, to a great extent, to the auto-regressive next-token training scheme, and not necessarily to a particular choice of architecture.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to explore the capabilities of auto - regressive next - token predictors in logical and mathematical reasoning. Specifically, the paper attempts to answer the following questions: 1. **Are auto - regressive next - token predictors merely advanced auto - completion models, or are they capable of true logical reasoning?** - By introducing the Chain - of - Thought (CoT) technique, the authors show that these models can not only memorize large amounts of data but also perform complex logical reasoning. 2. **Is the ability of these models mainly attributed to the auto - regressive training scheme rather than specific architecture choices?** - Through theoretical analysis and experimental verification, the authors prove that even simple linear models can efficiently approximate any Turing - machine - computable function through auto - regressive training. 3. **How does auto - regressive learning differ from traditional supervised learning, and why can auto - regressive learning more effectively learn complex functions?** - The authors point out that in auto - regressive learning, the model can use intermediate tokens for supervision, which greatly simplifies the learning task. ### Main contributions 1. **Theoretical framework**: - The authors propose a theoretical framework for studying auto - regressive next - token predictors and prove that simple models (such as linear models) can efficiently approximate any Turing - machine - computable function during training. 2. **Length complexity**: - A new complexity measure, length complexity, is introduced to measure the number of intermediate tokens required for a model to learn a certain concept class. The authors analyze the relationship between length complexity and other complexity measures (such as sample complexity and running - time complexity). 3. **Experimental verification**: - The non - trivial performance of simple models (such as linear networks and shallow multi - layer perceptrons) in text generation and arithmetic tasks is verified through experiments. For example, a shallow multi - layer perceptron can correctly perform four - digit multiplication given Chain - of - Thought data. ### Conclusion The main conclusion of the paper is that the powerful capabilities of modern large - language models in logical reasoning are largely attributed to the auto - regressive next - token training scheme, rather than just specific architecture choices. By introducing the Chain - of - Thought technique, even simple models can efficiently solve complex tasks. This finding provides a new perspective for understanding the potential of auto - regressive learning and lays the foundation for future theoretical and experimental research.

Auto-Regressive Next-Token Predictors are Universal Learners

Multimodal Latent Language Modeling with Next-Token Diffusion

Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve

Better & Faster Large Language Models via Multi-token Prediction

Autoregressive + Chain of Thought = Recurrent: Recurrence's Role in Language Models' Computability and a Revisit of Recurrent Transformer

The pitfalls of next-token prediction

A Law of Next-Token Prediction in Large Language Models

Embers of autoregression show how large language models are shaped by the problem they are trained to solve

Is Next Token Prediction Sufficient for GPT? Exploration on Code Logic Comprehension

Autoregressive Large Language Models are Computationally Universal

Mechanics of Next Token Prediction with Self-Attention

Towards a theory of how the structure of language is acquired by deep neural networks

Token-wise Decomposition of Autoregressive Language Model Hidden States for Analyzing Model Predictions

Language models are better than humans at next-token prediction

Confidence Regulation Neurons in Language Models

On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning

Next-token prediction capacity: general upper bounds and a lower bound for transformers

Meta predictive learning model of languages in neural circuits

Arithmetic with Language Models: from Memorization to Computation