Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Liliang Ren,Yang Liu,Yadong Lu,Yelong Shen,Chen Liang,Weizhu Chen
2024-06-12
Abstract:Efficiently modeling sequences with infinite context length has been a long-standing problem. Past works suffer from either the quadratic computation complexity or the limited extrapolation ability on length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and show that Samba substantially outperforms the state-of-the-art models based on pure attention or SSMs on a wide range of benchmarks. When trained on 4K length sequences, Samba can be efficiently extrapolated to 256K context length with perfect memory recall and show improved token predictions up to 1M context length. As a linear-time sequence model, Samba enjoys a 3.73x higher throughput compared to Transformers with grouped-query attention when processing user prompts of 128K length, and 3.64x speedup when generating 64K tokens with unlimited streaming. A sample implementation of Samba is publicly available in <a class="link-external link-https" href="https://github.com/microsoft/Samba" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
This paper proposes a new model called SAMBA for effectively modeling sequences with infinite context lengths. Traditional methods either face quadratic computational complexity or have limited generalization ability in handling sequence lengths. SAMBA is a hybrid architecture that combines the Selective State Space Model (SSM) called Mamba and Sliding Window Attention (SWA). It can selectively compress the given sequence into cyclic hidden states while maintaining precise memory recall capability. Through experiments in large-scale training, SAMBA significantly outperforms the current state-of-the-art models based on pure attention or SSM in a range of benchmark tests, and achieves linear time complexity in handling long sequences, improving processing efficiency. Specifically, SAMBA has been scaled up in different dimensions, reaching a maximum of 3.8 billion parameters and performing well in multiple tasks, including language understanding, commonsense reasoning, and mathematical problems. Compared to models that solely utilize attention mechanisms or SSM, SAMBA exhibits better predictive performance when dealing with long sequences, and can extrapolate sequences of infinite length with linear time complexity. Furthermore, even without additional training, SAMBA can extend from 4K-length sequences to 1M-length sequences while maintaining good memory recall capability. The paper also explores different hybrid strategies, including interweaving layers of Mamba, SWA, and Multilayer Perceptron (MLP), as well as comparing different linear recursive models such as Sliding RetNet and Gated Linear Attention (GLA). The results demonstrate that SAMBA performs the best across various tasks, especially in comprehension of long contexts and code generation tasks. In conclusion, SAMBA successfully addresses the challenge of modeling sequences with infinite context by integrating the advantages of SSM and attention mechanisms, improving the performance and efficiency of the model.