DeciMamba: Exploring the Length Extrapolation Potential of Mamba

Assaf Ben-Kish,Itamar Zimerman,Shady Abu-Hussein,Nadav Cohen,Amir Globerson,Lior Wolf,Raja Giryes
2024-06-21
Abstract:Long-range sequence processing poses a significant challenge for Transformers due to their quadratic complexity in input length. A promising alternative is Mamba, which demonstrates high performance and achieves Transformer-level capabilities while requiring substantially fewer computational resources. In this paper we explore the length-generalization capabilities of Mamba, which we find to be relatively limited. Through a series of visualizations and analyses we identify that the limitations arise from a restricted effective receptive field, dictated by the sequence length used during training. To address this constraint, we introduce DeciMamba, a context-extension method specifically designed for Mamba. This mechanism, built on top of a hidden filtering mechanism embedded within the S6 layer, enables the trained model to extrapolate well even without additional training. Empirical experiments over real-world long-range NLP tasks show that DeciMamba can extrapolate to context lengths that are 25x times longer than the ones seen during training, and does so without utilizing additional computational resources. We will release our code and models.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
This paper mainly discusses the limitations of the Mamba model in handling long sequences and proposes a new method called DeciMamba to enhance its length generalization ability. Mamba is an attention-free network with sub-quadratic complexity, which performs comparably to Transformer on multiple tasks but is limited when dealing with long sequences. The study found that the effective receptive field (ERF) of Mamba is limited within the range of training sequence lengths, resulting in insufficient information propagation when handling sequences beyond the training length. To address this issue, the paper introduces DeciMamba, a context expansion method specifically designed for Mamba. DeciMamba utilizes a hidden filtering mechanism inside the Mamba layer to expand the ERF through a pooling method with dynamic data dependencies, discarding unimportant tokens. This enables the model to handle sequences up to 25 times longer than the training length without increasing computational resources. Experimental results demonstrate that DeciMamba exhibits significant length extrapolation capability in real-world NLP tasks with long ranges. For example, in document retrieval and multi-document question answering tasks, DeciMamba outperforms the original Mamba model in handling longer input sequences. Additionally, DeciMamba performs well in language modeling tasks, particularly in handling the PG-19 dataset, achieving further sequence extrapolation at lower computational costs. In conclusion, the paper identifies the limitations of Mamba in handling long sequences and proposes DeciMamba as an effective approach to extend its ability to handle long contexts. This is crucial for real-world applications that require processing large amounts of data.