Mechanics of Next Token Prediction with Self-Attention

Yingcong Li,Yixiao Huang,M. Emrullah Ildiz,Ankit Singh Rawat,Samet Oymak

2024-03-13

Abstract:Transformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying this success is the self-attention mechanism. In this work, we ask: $\textit{What}$ $\textit{does}$ $\textit{a}$ $\textit{single}$ $\textit{self-attention}$ $\textit{layer}$ $\textit{learn}$ $\textit{from}$ $\textit{next-token}$ $\textit{prediction?}$ We show that training self-attention with gradient descent learns an automaton which generates the next token in two distinct steps: $\textbf{(1)}$ $\textbf{Hard}$ $\textbf{retrieval:}$ Given input sequence, self-attention precisely selects the $\textit{high-priority}$ $\textit{input}$ $\textit{tokens}$ associated with the last input token. $\textbf{(2)}$ $\textbf{Soft}$ $\textbf{composition:}$ It then creates a convex combination of the high-priority tokens from which the next token can be sampled. Under suitable conditions, we rigorously characterize these mechanics through a directed graph over tokens extracted from the training data. We prove that gradient descent implicitly discovers the strongly-connected components (SCC) of this graph and self-attention learns to retrieve the tokens that belong to the highest-priority SCC available in the context window. Our theory relies on decomposing the model weights into a directional component and a finite component that correspond to hard retrieval and soft composition steps respectively. This also formalizes a related implicit bias formula conjectured in [Tarzanagh et al. 2023]. We hope that these findings shed light on how self-attention processes sequential data and pave the path toward demystifying more complex architectures.

Machine Learning,Artificial Intelligence,Computation and Language,Optimization and Control

What problem does this paper attempt to address?

This paper investigates the learning behavior of self-attention mechanism in Transformer for predicting the next word task. It found that a single layer of self-attention learns a two-step mechanism through gradient descent: hard retrieval (precisely selecting words highly relevant to the previous input word) and soft composition (creating a convex combination of these high-priority words to output the next word). The paper characterizes this mechanism with a directed graph and proves that gradient descent implicitly discovers the strongly connected components in the graph, and self-attention learns to retrieve words from the highest priority strongly connected components. These findings contribute to understanding how self-attention handles sequential data and lay a foundation for analyzing more complex architectures.

Mechanics of Next Token Prediction with Self-Attention

Next-token prediction capacity: general upper bounds and a lower bound for transformers

Towards Understanding the Universality of Transformers for Next-Token Prediction

The pitfalls of next-token prediction

Non-asymptotic Convergence of Training Transformers for Next-token Prediction

From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers

A Law of Next-Token Prediction in Large Language Models

Beyond Intuition: Rethinking Token Attributions Inside Transformers

Easy attention: A simple attention mechanism for temporal predictions with transformers

Predictive Attention Transformer: Improving Transformer with Attention Map Prediction

Autoregressive Modeling with Lookahead Attention

Emu3: Next-Token Prediction is All You Need

Low-Rank and Locality Constrained Self-Attention for Sequence Modeling.

Auto-Regressive Next-Token Predictors are Universal Learners

Is Attention All What You Need? -- An Empirical Investigation on Convolution-Based Active Memory and Self-Attention

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Long-range Sequence Modeling with Predictable Sparse Attention.

Attention Please: What Transformer Models Really Learn for Process Prediction

TCSA-Net: A Temporal-Context-Based Self-Attention Network for Next Location Prediction

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

Attention as an RNN