Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data

Shufan Li,Harkanwar Singh,Aditya Grover
2024-07-14
Abstract:In recent years, Transformers have become the de-facto architecture for sequence modeling on text and a variety of multi-dimensional data, such as images and video. However, the use of self-attention layers in a Transformer incurs prohibitive compute and memory complexity that scales quadratically w.r.t. the sequence length. A recent architecture, Mamba, based on state space models has been shown to achieve comparable performance for modeling text sequences, while scaling linearly with the sequence length. In this work, we present Mamba-ND, a generalized design extending the Mamba architecture to arbitrary multi-dimensional data. Our design alternatively unravels the input data across different dimensions following row-major orderings. We provide a systematic comparison of Mamba-ND with several other alternatives, based on prior multi-dimensional extensions such as Bi-directional LSTMs and S4ND. Empirically, we show that Mamba-ND demonstrates performance competitive with the state-of-the-art on a variety of multi-dimensional benchmarks, including ImageNet-1K classification, HMDB-51 action recognition, and ERA5 weather forecasting.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper primarily attempts to solve the following problems: 1. **Computational complexity in multidimensional data modeling**: - The current Transformer architecture, when handling multidimensional data (such as images, videos, etc.), faces computational complexity that grows quadratically with the sequence length due to the self-attention mechanism. This makes it challenging for the model to scale to longer sequences. - The previous Mamba architecture achieved linear complexity through State Space Models (SSM) but performed well mainly on 1D text sequences. How to extend this to multidimensional data remains an unresolved issue. 2. **Effective methods for processing multidimensional data**: - A new design, Mamba-ND, is proposed, which processes multidimensional data by alternately unfolding different dimensions of the input data. This achieves performance comparable to or better than existing Transformer models while maintaining a lower parameter count and linear complexity. 3. **Comparative study of different design choices**: - Extensive ablation experiments were conducted on various possible designs, including bidirectional design (Bi-SSM), multidirectional design (ND-SSM), and multi-head design (Multi-Head-SSM). It was ultimately found that the alternating direction design is the simplest and most effective solution. Through these studies, the authors aim to provide a general and efficient framework for handling various multidimensional data tasks, including image classification, action recognition, weather forecasting, and 3D segmentation.