Abstract:Large language models have demonstrated impressive in-context learning (ICL) capability. However, it is still unclear how the underlying transformers accomplish it, especially in more complex scenarios. Toward this goal, several recent works studied how transformers learn fixed-order Markov chains (FOMC) in context, yet natural languages are more suitably modeled by variable-order Markov chains (VOMC), i.e., context trees (CTs). In this work, we study the ICL of VOMC by viewing language modeling as a form of data compression and focus on small alphabets and low-order VOMCs. This perspective allows us to leverage mature compression algorithms, such as context-tree weighting (CTW) and prediction by partial matching (PPM) algorithms as baselines, the former of which is Bayesian optimal for a class of CTW priors. We empirically observe a few phenomena: 1) Transformers can indeed learn to compress VOMC in-context, while PPM suffers significantly; 2) The performance of transformers is not very sensitive to the number of layers, and even a two-layer transformer can learn in-context quite well; and 3) Transformers trained and tested on non-CTW priors can significantly outperform the CTW algorithm. To explain these phenomena, we analyze the attention map of the transformers and extract two mechanisms, on which we provide two transformer constructions: 1) A construction with $D+2$ layers that can mimic the CTW algorithm accurately for CTs of maximum order $D$, 2) A 2-layer transformer that utilizes the feed-forward network for probability blending. One distinction from the FOMC setting is that a counting mechanism appears to play an important role. We implement these synthetic transformer layers and show that such hybrid transformers can match the ICL performance of transformers, and more interestingly, some of them can perform even better despite the much-reduced parameter sets.

MS-Transformer: Introduce multiple structural priors into a unified transformer for encoding sentences

Multiple Structural Priors Guided Self Attention Network for Language Understanding

Inducing Systematicity in Transformers by Attending to Structurally Quantized Embeddings

Constituent Attention for Vision Transformers

Understanding and Improving Transformer from a Multi-Particle Dynamic System Point of View.

Improving Transformers with Dynamically Composable Multi-Head Attention

Improved Transformer with Multi-Head Dense Collaboration

Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning

Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers

sTransformer: A Modular Approach for Extracting Inter-Sequential and Temporal Information for Time-Series Forecasting

Transformers are Universal In-context Learners

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Multi-branch Attentive Transformer

Multi-Unit Transformers for Neural Machine Translation

Metaformer: A Transformer That Tends to Mine Metaphorical-Level Information

EulerFormer: Sequential User Behavior Modeling with Complex Vector Attention

Deep Transformers with Latent Depth

MAFormer: A transformer network with multi-scale attention fusion for visual recognition

Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations

Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation

Transformers learn variable-order Markov chains in-context