Abstract:This paper addresses the challenge of creating a neural architecture for very long sequences that requires constant time for processing new information at each time step. Our approach, Associative Recurrent Memory Transformer (ARMT), is based on transformer self-attention for local context and segment-level recurrence for storage of task specific information distributed over a long context. We demonstrate that ARMT outperfors existing alternatives in associative retrieval tasks and sets a new performance record in the recent BABILong multi-task long-context benchmark by answering single-fact questions over 50 million tokens with an accuracy of 79.9%. The source code for training and evaluation is available on github.

What problem does this paper attempt to address?

This paper proposes a new neural architecture called the "Associative Recurrent Memory Transformer (ARMT)" that is designed to efficiently process very long sequences and be able to handle new information in constant time at each time step. ARMT uses transformer self-attention to handle local context and employs segment-level recurrence to store task-specific information that is distributed in long contexts. Compared to existing methods such as the Recurrent Memory Transformer (RMT) and Mamba, ARMT performs well on associative retrieval tasks and long-context processing tasks, especially in the BABILong multitask long-context benchmark test, achieving an accuracy of 79.9% for answering single-fact questions containing 50 million tokens. ARMT extends RMT by introducing an associative memory mechanism, utilizing global-local self-attention and maintaining similar time and space complexity as RMT. The paper demonstrates the robustness of ARMT in handling a large number of memory update operations and its superiority over retrieval-only methods in complex tasks that require reasoning across multiple pieces of information. Furthermore, ARMT exhibits high performance in handling sequences as long as 50 million tokens, demonstrating strong length generalization capabilities. The main contributions of the paper are as follows: 1) proposing a novel ARMT architecture for long-context processing; 2) demonstrating the superiority of ARMT over existing memory-based models in associative retrieval and long-context tasks; 3) developing an original method to evaluate the model memory capacity in associative retrieval tasks. In summary, ARMT provides an efficient solution for processing long sequences by enhancing the model's memory capabilities and reasoning abilities, thus improving the efficiency and accuracy in handling large-scale inputs.

Associative Recurrent Memory Transformer

HMT: Hierarchical Memory Transformer for Long Context Language Processing

In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Blockwise Parallel Transformer for Large Context Models

R-Transformer: Recurrent Neural Network Enhanced Transformer

Long-Term Memory Networks for Question Answering

Adaptive Multi-Resolution Attention with Linear Complexity

RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition

Scaling Transformer to 1M tokens and beyond with RMT

Recurrent Action Transformer with Memory

Associative Transformer

TRAMS: Training-free Memory Selection for Long-range Language Modeling

Transformer-xl: Language modeling with longer-term dependency

Relational recurrent neural networks

Memory Consolidation Enables Long-Context Video Understanding

Memorization Capacity of Multi-Head Attention in Transformers

Attention as an RNN

Ring Attention with Blockwise Transformers for Near-Infinite Context

Novel Architecture for Long Short-Term Memory Used in Question Classification

CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory

Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism