Abstract:State-space models (SSMs), such as Mamba (Gu & Dao, 2023), have been proposed as alternatives to Transformer networks in language modeling, by incorporating gating, convolutions, and input-dependent token selection to mitigate the quadratic cost of multi-head attention. Although SSMs exhibit competitive performance, their in-context learning (ICL) capabilities, a remarkable emergent property of modern language models that enables task execution without parameter optimization, remain underexplored compared to Transformers. In this study, we evaluate the ICL performance of SSMs, focusing on Mamba, against Transformer models across various tasks. Our results show that SSMs perform comparably to Transformers in standard regression ICL tasks, while outperforming them in tasks like sparse parity learning. However, SSMs fall short in tasks involving non-standard retrieval functionality. To address these limitations, we introduce a hybrid model, MambaFormer, that combines Mamba with attention blocks, surpassing individual models in tasks where they struggle independently. Our findings suggest that hybrid architectures offer promising avenues for enhancing ICL in language models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to explore whether models other than the attention mechanism can perform in - context learning (ICL). Specifically, the paper focuses on the performance of state - space models (SSMs), especially the Mamba model, in in - context learning tasks. Although SSMs have shown performance comparable to Transformer networks in language modeling, their ICL capabilities have not been fully studied. By comparing the performance of Mamba and Transformer models in various ICL tasks, the paper evaluates the ICL ability of Mamba and proposes a new hybrid architecture - MambaFormer - to combine the advantages of Mamba and Transformer and improve the performance of ICL tasks. ### Main problems 1. **ICL capabilities of SSMs**: - The paper explores whether SSMs, especially the Mamba model, can perform tasks without parameter optimization, that is, whether they have ICL capabilities. - Through a series of experiments, the paper evaluates the performance of Mamba in different ICL tasks and compares it with the Transformer model. 2. **Limitations of Mamba**: - The research finds that Mamba performs well in some tasks, such as sparse parity learning, but has limitations in other tasks, such as decision - tree learning and non - standard retrieval functions. 3. **Proposal of the hybrid model**: - To overcome the limitations of Mamba, the paper proposes MambaFormer, a hybrid architecture that combines Mamba and multi - head attention mechanisms. - MambaFormer performs well in all evaluated ICL tasks, especially in tasks where Mamba and Transformer each perform poorly. ### Experimental design - **Task types**: The paper designs a series of ICL tasks, including linear regression, sparse linear regression, two - layer neural network regression, decision tree, orthogonal outlier regression, multi - outlier regression, sparse parity learning, chain - of - thought input - output, vector - valued multi - query associative recall, etc. - **Model training**: Each model is trained from scratch, using randomly generated prompts for in - context learning. - **Performance evaluation**: The ICL performance of the model is evaluated by calculating the average loss of the model on the test prompts. ### Experimental results - **Performance of Mamba**: - Mamba performs well in simple tasks such as linear regression, comparable to Transformer. - In complex tasks such as sparse parity learning, Mamba significantly outperforms Transformer. - However, Mamba performs poorly in decision - tree learning and vector - valued multi - query associative recall tasks. - **Advantages of MambaFormer**: - MambaFormer performs well in all evaluated ICL tasks, especially in tasks where Mamba and Transformer each perform poorly. - The hybrid architecture achieves the best performance by combining the efficiency of Mamba and the strong expressive power of Transformer. ### Conclusion The paper proves through experiments that although Mamba has limitations in some ICL tasks, by introducing the hybrid architecture MambaFormer, the performance of the model in these tasks can be effectively improved. This shows that combining the advantages of different types of models is an effective way to improve ICL capabilities.

Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks

Is Mamba Capable of In-Context Learning?

Mamba State-Space Models Are Lyapunov-Stable Learners

An Empirical Study of Mamba-based Language Models

Learning Mamba as a Continual Learner

Mamba-CL: Optimizing Selective State Space Model in Null Space for Continual Learning

Can Custom Models Learn In-Context? An Exploration of Hybrid Architecture Performance on In-Context Learning Tasks

ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Revealing and Mitigating the Local Pattern Shortcuts of Mamba

State Space Models are Strong Text Rerankers

A Survey of Mamba

Exploring the Capability of Mamba in Speech Applications

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models

Taipan: Efficient and Expressive State Space Language Models with Selective Attention

ReMamba: Equip Mamba with Effective Long-Sequence Modeling

SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series

The Hidden Attention of Mamba Models

Sparse Mamba: Introducing Controllability, Observability, And Stability To Structural State Space Models