Abstract:While large language models based on the transformer architecture have demonstrated remarkable in-context learning (ICL) capabilities, understandings of such capabilities are still in an early stage, where existing theory and mechanistic understanding focus mostly on simple scenarios such as learning simple function classes. This paper takes initial steps on understanding ICL in more complex scenarios, by studying learning with representations. Concretely, we construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function, composed with a linear function that differs in each instance. By construction, the optimal ICL algorithm first transforms the inputs by the representation function, and then performs linear ICL on top of the transformed dataset. We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size. Empirically, we find trained transformers consistently achieve near-optimal ICL performance in this setting, and exhibit the desired dissection where lower layers transforms the dataset and upper layers perform linear ICL. Through extensive probing and a new pasting experiment, we further reveal several mechanisms within the trained transformers, such as concrete copying behaviors on both the inputs and the representations, linear ICL capability of the upper layers alone, and a post-ICL representation selection mechanism in a harder mixture setting. These observed mechanisms align well with our theory and may shed light on how transformers perform ICL in more realistic scenarios.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: understanding the In - Context Learning (ICL) ability of large - scale language models based on the Transformer architecture in more complex scenarios. Specifically, most of the existing understanding of ICL focuses on the learning of simple function classes, such as linear functions, shallow neural networks, and decision trees. However, in the real world, tasks are often more complex and need to handle learning problems with representations. ### Core Problems of the Paper By studying the synthetic in - context learning problem with fixed - representation functions, the paper aims to explore the following points: 1. **Theoretical Construction**: Are there Transformers that can approximately implement the optimal ICL algorithms? These algorithms first convert the input into a representation function and then perform linear ICL on the converted data set. 2. **Empirical Analysis**: Can a trained Transformer achieve near - optimal ICL performance in this setting and exhibit the expected anatomical structure, that is, the lower layers are responsible for converting data and the higher layers perform linear ICL? 3. **Mechanism Revelation**: Through extensive probing experiments and new pasting experiments, several mechanisms inside the Transformer are revealed, such as specific copying behaviors, the linear ICL ability of upper - level modules, and the post - ICL representation selection mechanism in more complex mixed settings. ### Research Background and Motivation Existing ICL research mainly focuses on simple scenarios, such as learning simple function classes. Although these studies have provided a certain basis for understanding ICL, they may not fully reflect the complex situations in the real world. For example, although the learning of linear functions on the original input is well - understood theoretically, in practical applications, prior knowledge can often assist learning, making the problem more complex. ### Main Contributions 1. **Theoretical Contribution**: Transformers that can perform ridge regression in context are constructed, which are suitable for learning scenarios with representations. These Transformers have relatively mild depth and size requirements and can predict each token, not just the last one. 2. **Empirical Contribution**: Experiments show that a trained small - sized Transformer can achieve near - optimal ICL performance in this setting and exhibit mechanisms consistent with the theory. 3. **Mechanism Discovery**: Through linear probing techniques and new pasting experiments, various low - level behaviors in the trained Transformer are discovered, such as specific copying behaviors of input and representation, the linear ICL ability of upper - level modules, and the post - ICL representation selection mechanism in more complex settings. ### Summary This paper expands the understanding of the Transformer's in - context learning ability by introducing learning problems with representations. It not only provides theoretical support but also proves the effectiveness of these theories through experiments and reveals some potential mechanisms, laying the foundation for further research on the Transformer's performance in more complex tasks.

How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations

Towards Understanding How Transformers Learn In-context Through a Representation Learning Lens

Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers

What Can Transformers Learn In-Context? A Case Study of Simple Function Classes

How Do Nonlinear Transformers Learn and Generalize in In-Context Learning?

Transformers are Deep Optimizers: Provable In-Context Learning for Deep Model Training

In-Context Learning with Representations: Contextual Generalization of Trained Transformers

Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data

In-context Learning on Function Classes Unveiled for Transformers

Transformers learn variable-order Markov chains in-context

Can Transformers Learn Sequential Function Classes In Context?

Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection

Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers

Trained Transformers Learn Linear Models In-Context

Breaking through the learning plateaus of in-context learning in Transformer

Asymptotic theory of in-context learning by linear attention

Provable In-Context Learning of Linear Systems and Linear Elliptic PDEs with Transformers

Does learning the right latent variables necessarily improve in-context learning?

Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions

On the Training Convergence of Transformers for In-Context Classification

Transformers are Minimax Optimal Nonparametric In-Context Learners