Abstract:Large-scale neural language models exhibit a remarkable capacity for in-context learning (ICL): they can infer novel functions from datasets provided as input. Most of our current understanding of when and how ICL arises comes from LMs trained on extremely simple learning problems like linear regression and associative recall. There remains a significant gap between these model problems and the "real" ICL exhibited by LMs trained on large text corpora, which involves not just retrieval and function approximation but free-form generation of language and other structured outputs. In this paper, we study ICL through the lens of a new family of model problems we term in context language learning (ICLL). In ICLL, LMs are presented with a set of strings from a formal language, and must generate additional strings from the same language. We focus on in-context learning of regular languages generated by random finite automata. We evaluate a diverse set of neural sequence models (including several RNNs, Transformers, and state-space model variants) on regular ICLL tasks, aiming to answer three questions: (1) Which model classes are empirically capable of ICLL? (2) What algorithmic solutions do successful models implement to perform ICLL? (3) What architectural changes can improve ICLL in less performant models? We first show that Transformers significantly outperform neural sequence models with recurrent or convolutional representations on ICLL tasks. Next, we provide evidence that their ability to do so relies on specialized "n-gram heads" (higher-order variants of induction heads) that compute input-conditional next-token distributions. Finally, we show that hard-wiring these heads into neural models improves performance not just on ICLL, but natural language modeling -- improving the perplexity of 340M-parameter models by up to 1.14 points (6.7%) on the SlimPajama dataset.

What problem does this paper attempt to address?

The paper investigates the ability of large-scale neural language models to learn in context (ICL) by proposing a new model family called In-Context Language Learning (ICLL). ICLL requires models to generate more strings from a given set of formal language strings. The main objectives of the research are to understand which model classes can effectively perform ICLL, what algorithms and solutions successful models implement, and how to improve the performance of underperforming models. By studying the behavior of different types of neural sequence models in handling ICLL tasks, particularly regular languages related to finite automata, the paper finds that the Transformer model significantly outperforms recursive and convolutional models in ICLL and that Transformers utilize special "n-gram heads" to enhance performance. Finally, by hardcoding these heads into different models, the synthesis of ICLL tasks is improved, as well as the performance of natural language modeling.

In-Context Language Learning: Architectures and Algorithms

From Unstructured Data to In-Context Learning: Exploring What Tasks Can Be Learned and When

LLMs Are In-Context Reinforcement Learners

What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization

Why Larger Language Models Do In-context Learning Differently?

Does In-Context Learning Really Learn? Rethinking How Large Language Models Respond and Solve Tasks via In-Context Learning

Decoding In-Context Learning: Neuroscience-inspired Analysis of Representations in Large Language Models

In-context Learning Generalizes, But Not Always Robustly: The Case of Syntax

What Do Language Models Learn in Context? The Structured Task Hypothesis

Schema-learning and rebinding as mechanisms of in-context learning and emergence

A Survey on In-context Learning

In-Context Learning Learns Label Relationships but Is Not Conventional Learning

Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability

A Data Generation Perspective to the Mechanism of In-Context Learning

Competition Dynamics Shape Algorithmic Phases of In-Context Learning

What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning

"In-Context Learning" or: How I learned to stop worrying and love "Applied Information Retrieval"

Revisiting In-context Learning Inference Circuit in Large Language Models

Do pretrained Transformers Learn In-Context by Gradient Descent?

ICLEval: Evaluating In-Context Learning Ability of Large Language Models

Can Custom Models Learn In-Context? An Exploration of Hybrid Architecture Performance on In-Context Learning Tasks