Abstract:Efficiently modeling sequences with infinite context length has been a long-standing problem. Past works suffer from either the quadratic computation complexity or the limited extrapolation ability on length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and show that Samba substantially outperforms the state-of-the-art models based on pure attention or SSMs on a wide range of benchmarks. When trained on 4K length sequences, Samba can be efficiently extrapolated to 256K context length with perfect memory recall and show improved token predictions up to 1M context length. As a linear-time sequence model, Samba enjoys a 3.73x higher throughput compared to Transformers with grouped-query attention when processing user prompts of 128K length, and 3.64x speedup when generating 64K tokens with unlimited streaming. A sample implementation of Samba is publicly available in <a class="link-external link-https" href="https://github.com/microsoft/Samba" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper proposes a new model called SAMBA for effectively modeling sequences with infinite context lengths. Traditional methods either face quadratic computational complexity or have limited generalization ability in handling sequence lengths. SAMBA is a hybrid architecture that combines the Selective State Space Model (SSM) called Mamba and Sliding Window Attention (SWA). It can selectively compress the given sequence into cyclic hidden states while maintaining precise memory recall capability. Through experiments in large-scale training, SAMBA significantly outperforms the current state-of-the-art models based on pure attention or SSM in a range of benchmark tests, and achieves linear time complexity in handling long sequences, improving processing efficiency. Specifically, SAMBA has been scaled up in different dimensions, reaching a maximum of 3.8 billion parameters and performing well in multiple tasks, including language understanding, commonsense reasoning, and mathematical problems. Compared to models that solely utilize attention mechanisms or SSM, SAMBA exhibits better predictive performance when dealing with long sequences, and can extrapolate sequences of infinite length with linear time complexity. Furthermore, even without additional training, SAMBA can extend from 4K-length sequences to 1M-length sequences while maintaining good memory recall capability. The paper also explores different hybrid strategies, including interweaving layers of Mamba, SWA, and Multilayer Perceptron (MLP), as well as comparing different linear recursive models such as Sliding RetNet and Gated Linear Attention (GLA). The results demonstrate that SAMBA performs the best across various tasks, especially in comprehension of long contexts and code generation tasks. In conclusion, SAMBA successfully addresses the challenge of modeling sequences with infinite context by integrating the advantages of SSM and attention mechanisms, improving the performance and efficiency of the model.

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

MambaByte: Token-free Selective State Space Model

Taipan: Efficient and Expressive State Space Language Models with Selective Attention

Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces

Longhorn: State Space Models are Amortized Online Learners

Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges

Efficient Long Sequence Modeling Via State Space Augmented Transformer

Bi-Mamba: Towards Accurate 1-Bit State Space Models

Stuffed Mamba: State Collapse and State Capacity of RNN-Based Long-Context Modeling

Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models

BlackMamba: Mixture of Experts for State-Space Models

An Empirical Study of Mamba-based Language Models

SMR: State Memory Replay for Long Sequence Modeling

VL-Mamba: Exploring State Space Models for Multimodal Learning

DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models

Efficiently Modeling Long Sequences with Structured State Spaces

Sparse Modular Activation for Efficient Sequence Modeling

Revealing and Mitigating the Local Pattern Shortcuts of Mamba