SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series

Badri N. Patro,Vijay S. Agneeswaran
2024-04-25
Abstract:Transformers have widely adopted attention networks for sequence mixing and MLPs for channel mixing, playing a pivotal role in achieving breakthroughs across domains. However, recent literature highlights issues with attention networks, including low inductive bias and quadratic complexity concerning input sequence length. State Space Models (SSMs) like S4 and others (Hippo, Global Convolutions, liquid S4, LRU, Mega, and Mamba), have emerged to address the above issues to help handle longer sequence lengths. Mamba, while being the state-of-the-art SSM, has a stability issue when scaled to large networks for computer vision datasets. We propose SiMBA, a new architecture that introduces Einstein FFT (EinFFT) for channel modeling by specific eigenvalue computations and uses the Mamba block for sequence modeling. Extensive performance studies across image and time-series benchmarks demonstrate that SiMBA outperforms existing SSMs, bridging the performance gap with state-of-the-art transformers. Notably, SiMBA establishes itself as the new state-of-the-art SSM on ImageNet and transfer learning benchmarks such as Stanford Car and Flower as well as task learning benchmarks as well as seven time series benchmark datasets. The project page is available on this website ~\url{
Computer Vision and Pattern Recognition,Machine Learning,Image and Video Processing,Systems and Control
What problem does this paper attempt to address?
The paper primarily aims to address the following issues: 1. **Stability Issues**: The existing Mamba architecture faces stability problems when handling large networks (e.g., on the ImageNet dataset), where the training loss fails to converge, leading to gradient vanishing or explosion issues. 2. **Performance Gap**: Although state space models (SSMs) like Mamba have advantages in handling long sequences, their performance on computer vision tasks still lags behind the state-of-the-art attention mechanism transformer models. To tackle the above challenges, the paper proposes SiMBA (Simplified Mamba-based Architecture), a new architecture that combines the advantages of Mamba sequence modeling with a new channel mixing technique called EinFFT. Specifically, the key contributions of SiMBA include: - **EinFFT**: A novel channel mixing technique that manipulates the frequency components of features using Fourier transform and ensures that all eigenvalues of matrix A are negative real numbers, thereby addressing the stability issues of Mamba. This technique is applicable not only to image data but also to other data modalities such as time series. - **Optimized Mamba Architecture**: SiMBA proposes an optimized version of the Mamba architecture for computer vision tasks, using EinFFT for channel mixing and employing residual connections and Dropout strategies to further enhance stability. - **Performance Improvement**: SiMBA successfully narrows the performance gap between state space models and advanced attention-based transformer models, demonstrating outstanding performance on the ImageNet dataset and six standard time series datasets. The experimental section showcases SiMBA's excellent performance at different scales, particularly in small-scale models, where SiMBA not only surpasses many advanced convolutional neural networks and transformer models in Top-1 accuracy but also maintains competitiveness in terms of parameter count and computational complexity. Additionally, SiMBA is evaluated on time series forecasting tasks, showing its strong capability in handling different types of data. Please note that due to the abrupt interruption of the provided text, the specific results on time series forecasting are not fully presented. From the available information, SiMBA performs excellently on multiple time series datasets, achieving better MSE and MAE metrics compared to other models.