Abstract:Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We identify intriguing properties of retrieval heads:(1) universal: all the explored models with long-context capability have a set of retrieval heads; (2) sparse: only a small portion (less than 5\%) of the attention heads are retrieval. (3) intrinsic: retrieval heads already exist in models pretrained with short context. When extending the context length by continual pretraining, it is still the same set of heads that perform information retrieval. (4) dynamically activated: take Llama-2 7B for example, 12 retrieval heads always attend to the required information no matter how the context is changed. The rest of the retrieval heads are activated in different contexts. (5) causal: completely pruning retrieval heads leads to failure in retrieving relevant information and results in hallucination, while pruning random non-retrieval heads does not affect the model's retrieval ability. We further show that retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the model needs to frequently refer back the question and previously-generated context. Conversely, tasks where the model directly generates the answer using its intrinsic knowledge are less impacted by masking out retrieval heads. These observations collectively explain which internal part of the model seeks information from the input tokens. We believe our insights will foster future research on reducing hallucination, improving reasoning, and compressing the KV cache.

A Peek Into the Memory of T5: Investigating the Factual Knowledge Memory in a Closed-Book QA Setting and Finding Responsible Parts

Neural Knowledge Bank for Pretrained Transformers

Modifying Memories in Transformer Models

Knowledge Neurons in Pretrained Transformers

Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models

Knowledge-Infused Self Attention Transformers

Kformer: Knowledge Injection in Transformer Feed-Forward Layers

Knowledge Circuits in Pretrained Transformers

Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions

Global memory transformer for processing long documents

Understanding Factual Recall in Transformers via Associative Memories

Predicting semantic category of answers for question answering systems using transformers: a transfer learning approach

Structural analysis of an all-purpose question answering model

Retrieval Head Mechanistically Explains Long-Context Factuality

Understanding Prior Bias and Choice Paralysis in Transformer-based Language Representation Models through Four Experimental Probes

Can Generative Pre-trained Language Models Serve As Knowledge Bases for Closed-book QA?

Temporality-enhanced Knowledgememory Network for Factoid Question Answering

Enhancing Key-Value Memory Neural Networks for Knowledge Based Question Answering.

MATTER: Memory-Augmented Transformer Using Heterogeneous Knowledge Sources

What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models

Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation