Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks

Jongho Park,Jaeseung Park,Zheyang Xiong,Nayoung Lee,Jaewoong Cho,Samet Oymak,Kangwook Lee,Dimitris Papailiopoulos
2024-04-26
Abstract:State-space models (SSMs), such as Mamba (Gu & Dao, 2023), have been proposed as alternatives to Transformer networks in language modeling, by incorporating gating, convolutions, and input-dependent token selection to mitigate the quadratic cost of multi-head attention. Although SSMs exhibit competitive performance, their in-context learning (ICL) capabilities, a remarkable emergent property of modern language models that enables task execution without parameter optimization, remain underexplored compared to Transformers. In this study, we evaluate the ICL performance of SSMs, focusing on Mamba, against Transformer models across various tasks. Our results show that SSMs perform comparably to Transformers in standard regression ICL tasks, while outperforming them in tasks like sparse parity learning. However, SSMs fall short in tasks involving non-standard retrieval functionality. To address these limitations, we introduce a hybrid model, MambaFormer, that combines Mamba with attention blocks, surpassing individual models in tasks where they struggle independently. Our findings suggest that hybrid architectures offer promising avenues for enhancing ICL in language models.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to explore whether models other than the attention mechanism can perform in - context learning (ICL). Specifically, the paper focuses on the performance of state - space models (SSMs), especially the Mamba model, in in - context learning tasks. Although SSMs have shown performance comparable to Transformer networks in language modeling, their ICL capabilities have not been fully studied. By comparing the performance of Mamba and Transformer models in various ICL tasks, the paper evaluates the ICL ability of Mamba and proposes a new hybrid architecture - MambaFormer - to combine the advantages of Mamba and Transformer and improve the performance of ICL tasks. ### Main problems 1. **ICL capabilities of SSMs**: - The paper explores whether SSMs, especially the Mamba model, can perform tasks without parameter optimization, that is, whether they have ICL capabilities. - Through a series of experiments, the paper evaluates the performance of Mamba in different ICL tasks and compares it with the Transformer model. 2. **Limitations of Mamba**: - The research finds that Mamba performs well in some tasks, such as sparse parity learning, but has limitations in other tasks, such as decision - tree learning and non - standard retrieval functions. 3. **Proposal of the hybrid model**: - To overcome the limitations of Mamba, the paper proposes MambaFormer, a hybrid architecture that combines Mamba and multi - head attention mechanisms. - MambaFormer performs well in all evaluated ICL tasks, especially in tasks where Mamba and Transformer each perform poorly. ### Experimental design - **Task types**: The paper designs a series of ICL tasks, including linear regression, sparse linear regression, two - layer neural network regression, decision tree, orthogonal outlier regression, multi - outlier regression, sparse parity learning, chain - of - thought input - output, vector - valued multi - query associative recall, etc. - **Model training**: Each model is trained from scratch, using randomly generated prompts for in - context learning. - **Performance evaluation**: The ICL performance of the model is evaluated by calculating the average loss of the model on the test prompts. ### Experimental results - **Performance of Mamba**: - Mamba performs well in simple tasks such as linear regression, comparable to Transformer. - In complex tasks such as sparse parity learning, Mamba significantly outperforms Transformer. - However, Mamba performs poorly in decision - tree learning and vector - valued multi - query associative recall tasks. - **Advantages of MambaFormer**: - MambaFormer performs well in all evaluated ICL tasks, especially in tasks where Mamba and Transformer each perform poorly. - The hybrid architecture achieves the best performance by combining the efficiency of Mamba and the strong expressive power of Transformer. ### Conclusion The paper proves through experiments that although Mamba has limitations in some ICL tasks, by introducing the hybrid architecture MambaFormer, the performance of the model in these tasks can be effectively improved. This shows that combining the advantages of different types of models is an effective way to improve ICL capabilities.