Abstract:Large language models (LLMs) demonstrate an impressive ability to utilise information within the context of their input sequences to appropriately respond to data unseen by the LLM during its training procedure. This ability is known as in-context learning (ICL). Humans and non-human animals demonstrate similar abilities, however their neural architectures differ substantially from LLMs. Despite this, a critical component within LLMs, the attention mechanism, resembles modern associative memory models, widely used in and influenced by the computational neuroscience community to model biological memory systems. Using this connection, we introduce an associative memory model capable of performing ICL. We use this as inspiration for a novel residual stream architecture which allows information to directly flow between attention heads. We test this architecture during training within a two-layer Transformer and show its ICL abilities manifest more quickly than without this modification. We then apply our architecture in small language models with 8 million parameters, focusing on attention head values, with results also indicating improved ICL performance at this larger and more naturalistic scale.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to improve the in - context learning (ICL) ability of large language models (LLMs) by introducing a new attention residual stream architecture. Specifically, the authors hope: 1. **Draw on the associative memory model**: Use the associative memory model to improve the ICL performance in the Transformer model. The associative memory model is widely used in the field of computational neuroscience to simulate biological memory systems. 2. **Propose a novel architecture**: Design a new residual stream architecture so that information can flow directly between attention heads, thereby accelerating and enhancing the ICL ability. 3. **Verify the effectiveness of the new architecture**: Test the effect of this new architecture in models of different scales, including small - language models (8 million parameters) and two - layer Transformer models, and prove its effectiveness on larger and more natural data sets. ### Main content of the paper #### 1. Introduction The paper first introduces the concept of in - context learning (ICL) and its importance. ICL refers to the ability of a model to appropriately use context information for reasoning and prediction in unseen input data, similar to the cognitive abilities of humans and non - human animals. Although existing LLMs have demonstrated strong ICL abilities, their neural architectures are very different from biological systems. However, the attention mechanism in LLMs has similarities with the associative memory model, which provides inspiration for improving ICL. #### 2. Associative memory model and ICL The authors introduce an associative memory model named AMICL (Associative Memory for In - Context Learning), which can perform ICL tasks on the basis of a single - layer Transformer attention head. The AMICL model achieves ICL by converting input data into key, query, and value vectors and using these vectors for pattern matching and completion. #### 3. Residual stream architecture Inspired by the AMICL model, the authors propose a new residual stream architecture that allows the values between attention heads to be directly transferred. This architecture was tested in a two - layer Transformer model, and the results showed a significant improvement in ICL ability. The specific formulas are as follows: \[ Q_2 = W_q^2 X+Q_1, \quad K_2 = W_k^2 X+K_1, \quad V_2 = W_v^2 X+V_1 \] where \(Q_1, K_1, V_1\) are the outputs of the first - layer attention head, and \(Q_2, K_2, V_2\) are the outputs of the second - layer attention head, and \(W_q^2, W_k^2, W_v^2\) are the weight matrices of the second layer. #### 4. Testing of small - language models To verify the effect of the new architecture in larger - scale models, the authors tested it on a small - language model with approximately 8 million parameters. The results show that the model with the residual stream architecture outperforms the traditional model in both training loss and validation loss and performs better in the indirect object identification (IOI) task. ### Conclusion By introducing the idea of the associative memory model, the authors successfully designed a new residual stream architecture, which significantly improved the ICL ability of the Transformer model. This improvement has been verified not only in small models but also shows potential in larger - scale models, providing valuable insights for further research on neural network design and understanding of biological cognition.

Associative memory inspires improvements for in-context learning using a novel attention residual stream architecture

Decoding In-Context Learning: Neuroscience-inspired Analysis of Representations in Large Language Models

In-Context Language Learning: Architectures and Algorithms

Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability

Linking In-context Learning in Transformers to Human Episodic Memory

Interactive Continual Learning: Fast and Slow Thinking

Why Larger Language Models Do In-context Learning Differently?

Scaling In-Context Demonstrations with Structured Attention

Revisiting In-context Learning Inference Circuit in Large Language Models

In-Context Exemplars as Clues to Retrieving from Large Associative Memory

CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory

Incremental Accumulation of Linguistic Context in Artificial and Biological Neural Networks

Brain-Like Language Processing via a Shallow Untrained Multihead Attention Network

Just read twice: closing the recall gap for recurrent language models

From Unstructured Data to In-Context Learning: Exploring What Tasks Can Be Learned and When

Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization

Augmenting Language Models with Long-Term Memory

Memorization in In-Context Learning

RecallM: An Adaptable Memory Mechanism with Temporal Understanding for Large Language Models