Your Context Is Not an Array: Unveiling Random Access Limitations in Transformers

MohammadReza Ebrahimi,Sunny Panchal,Roland Memisevic
2024-08-10
Abstract:Despite their recent successes, Transformer-based large language models show surprising failure modes. A well-known example of such failure modes is their inability to length-generalize: solving problem instances at inference time that are longer than those seen during training. In this work, we further explore the root cause of this failure by performing a detailed analysis of model behaviors on the simple parity task. Our analysis suggests that length generalization failures are intricately related to a model's inability to perform random memory accesses within its context window. We present supporting evidence for this hypothesis by demonstrating the effectiveness of methodologies that circumvent the need for indexing or that enable random token access indirectly, through content-based addressing. We further show where and how the failure to perform random memory access manifests through attention map visualizations.
Computation and Language
What problem does this paper attempt to address?
The paper primarily explores a core issue encountered by Transformer models when handling algorithmic tasks: length generalization. Specifically, although Transformer models perform excellently in natural language processing tasks, they struggle with problem instances longer than the training data, especially in tasks requiring precise positional information. To address this issue, the authors conducted a detailed analysis of a simple algorithmic task—the binary parity task—and found that Transformer models have difficulty performing random memory access (i.e., retrieving information based on exact positions). This is because Transformer models mainly rely on content-based attention mechanisms when handling natural language tasks, and this mechanism performs poorly in tasks requiring index-based memory access. To validate this hypothesis, the researchers proposed two methods to help the model overcome this limitation: 1. **Interleaved Scratchpad**: By specially formatting the input sequence so that the currently active bit always appears in the last position of the context window, simplifying the random access operations the model needs to perform. 2. **Mnemonics**: Adding matching "anchor" markers to the standard scratchpad format, allowing the model to indirectly achieve index-based access through content-based attention mechanisms. These mnemonics enable the model to backtrack to previous information, thus solving the random access problem. Additionally, the authors analyzed attention patterns and experimented with different variants of mnemonics to further support their hypothesis. Finally, the research extended to another algorithmic task—multi-digit addition—and demonstrated how using mnemonics helped the model learn the correct algorithm and achieve length generalization. In summary, the paper attempts to address the question: Why do Transformer models struggle with length generalization in algorithmic tasks requiring precise positional information? Through experiments, it was verified that the model's lack of effective index-based memory access is the fundamental reason for this issue, and solutions were proposed.