Abstract:Despite their recent successes, Transformer-based large language models show surprising failure modes. A well-known example of such failure modes is their inability to length-generalize: solving problem instances at inference time that are longer than those seen during training. In this work, we further explore the root cause of this failure by performing a detailed analysis of model behaviors on the simple parity task. Our analysis suggests that length generalization failures are intricately related to a model's inability to perform random memory accesses within its context window. We present supporting evidence for this hypothesis by demonstrating the effectiveness of methodologies that circumvent the need for indexing or that enable random token access indirectly, through content-based addressing. We further show where and how the failure to perform random memory access manifests through attention map visualizations.

What problem does this paper attempt to address?

The paper primarily explores a core issue encountered by Transformer models when handling algorithmic tasks: length generalization. Specifically, although Transformer models perform excellently in natural language processing tasks, they struggle with problem instances longer than the training data, especially in tasks requiring precise positional information. To address this issue, the authors conducted a detailed analysis of a simple algorithmic task—the binary parity task—and found that Transformer models have difficulty performing random memory access (i.e., retrieving information based on exact positions). This is because Transformer models mainly rely on content-based attention mechanisms when handling natural language tasks, and this mechanism performs poorly in tasks requiring index-based memory access. To validate this hypothesis, the researchers proposed two methods to help the model overcome this limitation: 1. **Interleaved Scratchpad**: By specially formatting the input sequence so that the currently active bit always appears in the last position of the context window, simplifying the random access operations the model needs to perform. 2. **Mnemonics**: Adding matching "anchor" markers to the standard scratchpad format, allowing the model to indirectly achieve index-based access through content-based attention mechanisms. These mnemonics enable the model to backtrack to previous information, thus solving the random access problem. Additionally, the authors analyzed attention patterns and experimented with different variants of mnemonics to further support their hypothesis. Finally, the research extended to another algorithmic task—multi-digit addition—and demonstrated how using mnemonics helped the model learn the correct algorithm and achieve length generalization. In summary, the paper attempts to address the question: Why do Transformer models struggle with length generalization in algorithmic tasks requiring precise positional information? Through experiments, it was verified that the model's lack of effective index-based memory access is the fundamental reason for this issue, and solutions were proposed.

Your Context Is Not an Array: Unveiling Random Access Limitations in Transformers

Equipping Transformer with Random-Access Reading for Long-Context Understanding

Algorithmic Capabilities of Random Transformers

Does learning the right latent variables necessarily improve in-context learning?

Memorization in Attention-only Transformers

Transformers are Universal In-context Learners

What Algorithms can Transformers Learn? A Study in Length Generalization

Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis

Insights into LLM Long-Context Failures: When Transformers Know but Don't Tell

Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task

Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation

Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

How do Transformers perform In-Context Autoregressive Learning?

Unveiling and Controlling Anomalous Attention Distribution in Transformers

Representational Strengths and Limitations of Transformers

Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning

Extended Mind Transformers

Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length Extrapolation

Blockwise Parallel Transformer for Large Context Models

On the Ability and Limitations of Transformers to Recognize Formal Languages

Context-Scaling versus Task-Scaling in In-Context Learning