Associative Recurrent Memory Transformer

Ivan Rodkin,Yuri Kuratov,Aydar Bulatov,Mikhail Burtsev
2024-07-06
Abstract:This paper addresses the challenge of creating a neural architecture for very long sequences that requires constant time for processing new information at each time step. Our approach, Associative Recurrent Memory Transformer (ARMT), is based on transformer self-attention for local context and segment-level recurrence for storage of task specific information distributed over a long context. We demonstrate that ARMT outperfors existing alternatives in associative retrieval tasks and sets a new performance record in the recent BABILong multi-task long-context benchmark by answering single-fact questions over 50 million tokens with an accuracy of 79.9%. The source code for training and evaluation is available on github.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper proposes a new neural architecture called the "Associative Recurrent Memory Transformer (ARMT)" that is designed to efficiently process very long sequences and be able to handle new information in constant time at each time step. ARMT uses transformer self-attention to handle local context and employs segment-level recurrence to store task-specific information that is distributed in long contexts. Compared to existing methods such as the Recurrent Memory Transformer (RMT) and Mamba, ARMT performs well on associative retrieval tasks and long-context processing tasks, especially in the BABILong multitask long-context benchmark test, achieving an accuracy of 79.9% for answering single-fact questions containing 50 million tokens. ARMT extends RMT by introducing an associative memory mechanism, utilizing global-local self-attention and maintaining similar time and space complexity as RMT. The paper demonstrates the robustness of ARMT in handling a large number of memory update operations and its superiority over retrieval-only methods in complex tasks that require reasoning across multiple pieces of information. Furthermore, ARMT exhibits high performance in handling sequences as long as 50 million tokens, demonstrating strong length generalization capabilities. The main contributions of the paper are as follows: 1) proposing a novel ARMT architecture for long-context processing; 2) demonstrating the superiority of ARMT over existing memory-based models in associative retrieval and long-context tasks; 3) developing an original method to evaluate the model memory capacity in associative retrieval tasks. In summary, ARMT provides an efficient solution for processing long sequences by enhancing the model's memory capabilities and reasoning abilities, thus improving the efficiency and accuracy in handling large-scale inputs.