DeciMamba: Exploring the Length Extrapolation Potential of Mamba

Assaf Ben-Kish,Itamar Zimerman,Shady Abu-Hussein,Nadav Cohen,Amir Globerson,Lior Wolf,Raja Giryes

2024-06-21

Abstract:Long-range sequence processing poses a significant challenge for Transformers due to their quadratic complexity in input length. A promising alternative is Mamba, which demonstrates high performance and achieves Transformer-level capabilities while requiring substantially fewer computational resources. In this paper we explore the length-generalization capabilities of Mamba, which we find to be relatively limited. Through a series of visualizations and analyses we identify that the limitations arise from a restricted effective receptive field, dictated by the sequence length used during training. To address this constraint, we introduce DeciMamba, a context-extension method specifically designed for Mamba. This mechanism, built on top of a hidden filtering mechanism embedded within the S6 layer, enables the trained model to extrapolate well even without additional training. Empirical experiments over real-world long-range NLP tasks show that DeciMamba can extrapolate to context lengths that are 25x times longer than the ones seen during training, and does so without utilizing additional computational resources. We will release our code and models.

Machine Learning,Artificial Intelligence

What problem does this paper attempt to address?

This paper mainly discusses the limitations of the Mamba model in handling long sequences and proposes a new method called DeciMamba to enhance its length generalization ability. Mamba is an attention-free network with sub-quadratic complexity, which performs comparably to Transformer on multiple tasks but is limited when dealing with long sequences. The study found that the effective receptive field (ERF) of Mamba is limited within the range of training sequence lengths, resulting in insufficient information propagation when handling sequences beyond the training length. To address this issue, the paper introduces DeciMamba, a context expansion method specifically designed for Mamba. DeciMamba utilizes a hidden filtering mechanism inside the Mamba layer to expand the ERF through a pooling method with dynamic data dependencies, discarding unimportant tokens. This enables the model to handle sequences up to 25 times longer than the training length without increasing computational resources. Experimental results demonstrate that DeciMamba exhibits significant length extrapolation capability in real-world NLP tasks with long ranges. For example, in document retrieval and multi-document question answering tasks, DeciMamba outperforms the original Mamba model in handling longer input sequences. Additionally, DeciMamba performs well in language modeling tasks, particularly in handling the PG-19 dataset, achieving further sequence extrapolation at lower computational costs. In conclusion, the paper identifies the limitations of Mamba in handling long sequences and proposes DeciMamba as an effective approach to extend its ability to handle long contexts. This is crucial for real-world applications that require processing large amounts of data.

DeciMamba: Exploring the Length Extrapolation Potential of Mamba

ReMamba: Equip Mamba with Effective Long-Sequence Modeling

PackMamba: Efficient Processing of Variable-Length Sequences in Mamba training

An Empirical Study of Mamba-based Language Models

Can Mamba Always Enjoy the "Free Lunch"?

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data

Mamba Retriever: Utilizing Mamba for Effective and Efficient Dense Retrieval

A Survey of Mamba

Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models

MambaDepth: Enhancing Long-range Dependency for Self-Supervised Fine-Structured Monocular Depth Estimation

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation

VideoMambaPro: A Leap Forward for Mamba in Video Understanding

Megalodon: Efficient llm pretraining and inference with unlimited context length

ChiMamba: Predicting Chromatin Interactions Based on Mamba

Integrating Mamba and Transformer for Long-Short Range Time Series Forecasting

BlackMamba: Mixture of Experts for State-Space Models

MatMamba: A Matryoshka State Space Model