Schema-learning and rebinding as mechanisms of in-context learning and emergence

Sivaramakrishnan Swaminathan,Antoine Dedieu,Rajkumar Vasudeva Raju,Murray Shanahan,Miguel Lazaro-Gredilla,Dileep George
2023-06-16
Abstract:In-context learning (ICL) is one of the most powerful and most unexpected capabilities to emerge in recent transformer-based large language models (LLMs). Yet the mechanisms that underlie it are poorly understood. In this paper, we demonstrate that comparable ICL capabilities can be acquired by an alternative sequence prediction learning method using clone-structured causal graphs (CSCGs). Moreover, a key property of CSCGs is that, unlike transformer-based LLMs, they are {\em interpretable}, which considerably simplifies the task of explaining how ICL works. Specifically, we show that it uses a combination of (a) learning template (schema) circuits for pattern completion, (b) retrieving relevant templates in a context-sensitive manner, and (c) rebinding of novel tokens to appropriate slots in the templates. We go on to marshall evidence for the hypothesis that similar mechanisms underlie ICL in LLMs. For example, we find that, with CSCGs as with LLMs, different capabilities emerge at different levels of overparameterization, suggesting that overparameterization helps in learning more complex template (schema) circuits. By showing how ICL can be achieved with small models and datasets, we open up a path to novel architectures, and take a vital step towards a more general understanding of the mechanics behind this important capability.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to understand the mechanism of in - context learning (ICL) in large - language models (LLMs). ICL refers to the ability of pre - trained language models to quickly learn new tasks from a small number of examples during inference, even though these models are trained with the goal of predicting only the next word. This ability enables LLMs to handle a wider range of application scenarios, but the mechanism behind it has not been fully understood. Specifically, the paper explores the potential mechanism of ICL by introducing an alternative sequence - prediction learning method - Clone - Structured Causal Graphs (CSCGs). CSCGs are different from Transformer - based LLMs in that they are interpretable, which simplifies the explanation of how ICL works. The authors show that CSCGs can achieve ICL in the following ways: 1. **Learning Schema circuits**: used for pattern completion. 2. **Retrieving relevant schemas according to the context**: retrieved in a context - sensitive manner. 3. **Rebinding new tokens**: binding new tokens to the appropriate positions in the schema. In addition, the paper also provides evidence to support the hypothesis that a similar mechanism exists in Transformer - based LLMs. For example, the study found that with different degrees of over - parameterization, different capabilities will emerge in CSCGs and LLMs respectively, indicating that over - parameterization helps to learn more complex schema circuits. By showing how to achieve ICL with small models and data sets, the paper paves the way for the design of new architectures and takes a crucial step towards a more comprehensive understanding of the mechanism behind this important ability. ### Formula Summary The formulas involved in the paper are mainly used to describe the probability model and update algorithm of CSCG. Here are several key formulas: - **Probability distribution of the observed sequence**: \[ P(x_1, \dots, x_N | a_1, \dots, a_{N - 1})=\sum_{z_1, \dots, z_N} P(x_1 | z_1) P(z_1)\prod_{n = 2}^N P(x_n | z_n) P(z_n | z_{n - 1}, a_{n - 1}) \] - **Definitions of the transition tensor \(T\) and the emission matrix \(E\)**: \[ T_{ijk}=P(Z_n = k | Z_{n - 1}=j, a_{n - 1}=i) \] \[ E_{ij}=P(X_n = j | Z_n = i) \] - **Conditional probability used in the fast rebinding algorithm**: \[ p(X_n = j | x_{\setminus n})=p(X_n = j | x_1, \dots, x_{n - 1}, x_{n + 1}, \dots, x_N) \] These formulas help to explain how CSCGs learn in context and adapt to new environments or inputs.