Abstract:In-context learning (ICL) is one of the most powerful and most unexpected capabilities to emerge in recent transformer-based large language models (LLMs). Yet the mechanisms that underlie it are poorly understood. In this paper, we demonstrate that comparable ICL capabilities can be acquired by an alternative sequence prediction learning method using clone-structured causal graphs (CSCGs). Moreover, a key property of CSCGs is that, unlike transformer-based LLMs, they are {\em interpretable}, which considerably simplifies the task of explaining how ICL works. Specifically, we show that it uses a combination of (a) learning template (schema) circuits for pattern completion, (b) retrieving relevant templates in a context-sensitive manner, and (c) rebinding of novel tokens to appropriate slots in the templates. We go on to marshall evidence for the hypothesis that similar mechanisms underlie ICL in LLMs. For example, we find that, with CSCGs as with LLMs, different capabilities emerge at different levels of overparameterization, suggesting that overparameterization helps in learning more complex template (schema) circuits. By showing how ICL can be achieved with small models and datasets, we open up a path to novel architectures, and take a vital step towards a more general understanding of the mechanics behind this important capability.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to understand the mechanism of in - context learning (ICL) in large - language models (LLMs). ICL refers to the ability of pre - trained language models to quickly learn new tasks from a small number of examples during inference, even though these models are trained with the goal of predicting only the next word. This ability enables LLMs to handle a wider range of application scenarios, but the mechanism behind it has not been fully understood. Specifically, the paper explores the potential mechanism of ICL by introducing an alternative sequence - prediction learning method - Clone - Structured Causal Graphs (CSCGs). CSCGs are different from Transformer - based LLMs in that they are interpretable, which simplifies the explanation of how ICL works. The authors show that CSCGs can achieve ICL in the following ways: 1. **Learning Schema circuits**: used for pattern completion. 2. **Retrieving relevant schemas according to the context**: retrieved in a context - sensitive manner. 3. **Rebinding new tokens**: binding new tokens to the appropriate positions in the schema. In addition, the paper also provides evidence to support the hypothesis that a similar mechanism exists in Transformer - based LLMs. For example, the study found that with different degrees of over - parameterization, different capabilities will emerge in CSCGs and LLMs respectively, indicating that over - parameterization helps to learn more complex schema circuits. By showing how to achieve ICL with small models and data sets, the paper paves the way for the design of new architectures and takes a crucial step towards a more comprehensive understanding of the mechanism behind this important ability. ### Formula Summary The formulas involved in the paper are mainly used to describe the probability model and update algorithm of CSCG. Here are several key formulas: - **Probability distribution of the observed sequence**: \[ P(x_1, \dots, x_N | a_1, \dots, a_{N - 1})=\sum_{z_1, \dots, z_N} P(x_1 | z_1) P(z_1)\prod_{n = 2}^N P(x_n | z_n) P(z_n | z_{n - 1}, a_{n - 1}) \] - **Definitions of the transition tensor \(T\) and the emission matrix \(E\)**: \[ T_{ijk}=P(Z_n = k | Z_{n - 1}=j, a_{n - 1}=i) \] \[ E_{ij}=P(X_n = j | Z_n = i) \] - **Conditional probability used in the fast rebinding algorithm**: \[ p(X_n = j | x_{\setminus n})=p(X_n = j | x_1, \dots, x_{n - 1}, x_{n + 1}, \dots, x_N) \] These formulas help to explain how CSCGs learn in context and adapt to new environments or inputs.

Schema-learning and rebinding as mechanisms of in-context learning and emergence

In-Context Language Learning: Architectures and Algorithms

From Unstructured Data to In-Context Learning: Exploring What Tasks Can Be Learned and When

How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations

A Theory of Emergent In-Context Learning as Implicit Structure Induction

Explaining Emergent In-Context Learning as Kernel Regression

Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data

In-context Learning Generalizes, But Not Always Robustly: The Case of Syntax

Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition

What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization

Revisiting In-context Learning Inference Circuit in Large Language Models

A Data Generation Perspective to the Mechanism of In-Context Learning

ICLEval: Evaluating In-Context Learning Ability of Large Language Models

LLMs Are In-Context Reinforcement Learners

Unveiling In-Context Learning: A Coordinate System to Understand Its Working Mechanism

What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning

Does In-Context Learning Really Learn? Rethinking How Large Language Models Respond and Solve Tasks via In-Context Learning

Why Larger Language Models Do In-context Learning Differently?

Mechanisms of Symbol Processing for In-Context Learning in Transformer Networks

Decoding In-Context Learning: Neuroscience-inspired Analysis of Representations in Large Language Models

Competition Dynamics Shape Algorithmic Phases of In-Context Learning