Associative memory inspires improvements for in-context learning using a novel attention residual stream architecture

Thomas F Burns,Tomoki Fukai,Christopher J Earls
2024-12-20
Abstract:Large language models (LLMs) demonstrate an impressive ability to utilise information within the context of their input sequences to appropriately respond to data unseen by the LLM during its training procedure. This ability is known as in-context learning (ICL). Humans and non-human animals demonstrate similar abilities, however their neural architectures differ substantially from LLMs. Despite this, a critical component within LLMs, the attention mechanism, resembles modern associative memory models, widely used in and influenced by the computational neuroscience community to model biological memory systems. Using this connection, we introduce an associative memory model capable of performing ICL. We use this as inspiration for a novel residual stream architecture which allows information to directly flow between attention heads. We test this architecture during training within a two-layer Transformer and show its ICL abilities manifest more quickly than without this modification. We then apply our architecture in small language models with 8 million parameters, focusing on attention head values, with results also indicating improved ICL performance at this larger and more naturalistic scale.
Neural and Evolutionary Computing,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to improve the in - context learning (ICL) ability of large language models (LLMs) by introducing a new attention residual stream architecture. Specifically, the authors hope: 1. **Draw on the associative memory model**: Use the associative memory model to improve the ICL performance in the Transformer model. The associative memory model is widely used in the field of computational neuroscience to simulate biological memory systems. 2. **Propose a novel architecture**: Design a new residual stream architecture so that information can flow directly between attention heads, thereby accelerating and enhancing the ICL ability. 3. **Verify the effectiveness of the new architecture**: Test the effect of this new architecture in models of different scales, including small - language models (8 million parameters) and two - layer Transformer models, and prove its effectiveness on larger and more natural data sets. ### Main content of the paper #### 1. Introduction The paper first introduces the concept of in - context learning (ICL) and its importance. ICL refers to the ability of a model to appropriately use context information for reasoning and prediction in unseen input data, similar to the cognitive abilities of humans and non - human animals. Although existing LLMs have demonstrated strong ICL abilities, their neural architectures are very different from biological systems. However, the attention mechanism in LLMs has similarities with the associative memory model, which provides inspiration for improving ICL. #### 2. Associative memory model and ICL The authors introduce an associative memory model named AMICL (Associative Memory for In - Context Learning), which can perform ICL tasks on the basis of a single - layer Transformer attention head. The AMICL model achieves ICL by converting input data into key, query, and value vectors and using these vectors for pattern matching and completion. #### 3. Residual stream architecture Inspired by the AMICL model, the authors propose a new residual stream architecture that allows the values between attention heads to be directly transferred. This architecture was tested in a two - layer Transformer model, and the results showed a significant improvement in ICL ability. The specific formulas are as follows: \[ Q_2 = W_q^2 X+Q_1, \quad K_2 = W_k^2 X+K_1, \quad V_2 = W_v^2 X+V_1 \] where \(Q_1, K_1, V_1\) are the outputs of the first - layer attention head, and \(Q_2, K_2, V_2\) are the outputs of the second - layer attention head, and \(W_q^2, W_k^2, W_v^2\) are the weight matrices of the second layer. #### 4. Testing of small - language models To verify the effect of the new architecture in larger - scale models, the authors tested it on a small - language model with approximately 8 million parameters. The results show that the model with the residual stream architecture outperforms the traditional model in both training loss and validation loss and performs better in the indirect object identification (IOI) task. ### Conclusion By introducing the idea of the associative memory model, the authors successfully designed a new residual stream architecture, which significantly improved the ICL ability of the Transformer model. This improvement has been verified not only in small models but also shows potential in larger - scale models, providing valuable insights for further research on neural network design and understanding of biological cognition.