Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models

Xiaoxue Gao,Nancy F. Chen
2024-09-27
Abstract:Current automatic speech recognition systems struggle with modeling long speech sequences due to high quadratic complexity of Transformer-based models. Selective state space models such as Mamba has performed well on long-sequence modeling in natural language processing and computer vision tasks. However, research endeavors in speech technology tasks has been under-explored. We propose Speech-Mamba, which incorporates selective state space modeling in Transformer neural architectures. Long sequence representations with selective state space models in Speech-Mamba is complemented with lower-level representations from Transformer-based modeling. Speech-mamba achieves better capacity to model long-range dependencies, as it scales near-linearly with sequence length.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to address the challenges currently encountered by Automatic Speech Recognition (ASR) systems when processing long - speech sequences. Specifically, Transformer - based models perform poorly in modeling long - speech sequences due to their high quadratic complexity. The paper proposes a new method named Speech - Mamba, which combines Selective State Space Models (SSMs) with the Transformer architecture to more effectively capture long - range dependencies and improve the transcription accuracy of long - speech sequences. #### Main problems: 1. **Challenges in modeling long - speech sequences**: Existing Transformer - based ASR systems face the problem of high computational complexity when processing long - speech sequences, leading to a decline in model performance. 2. **Limitations of existing models**: Although Transformer performs well in low - level speech and text representations, it has deficiencies in handling long - distance dependencies. 3. **Exploring the application of selective state space models**: Although selective state space models (such as Mamba) perform well in natural language processing and computer vision tasks, their application in speech technology tasks has not been fully explored. ### Solutions To address the above problems, the paper proposes Speech - Mamba, which improves the modeling ability of long - speech sequences in the following ways: - **Combining selective state space models with Transformer**: Utilize the powerful convolutional and near - linear computational capabilities of the Mamba model to enhance the capture of long - range dependencies. - **Multi - objective learning**: Employ joint Connectionist Temporal Classification (CTC) and Sequence - to - Sequence (S2S) losses for training to improve the generalization ability of the model on different types of data. - **Experimental verification**: Through extensive experiments on the LibriSpeech dataset, the superior performance of Speech - Mamba in modeling long - speech sequences is verified. ### Formula representation The formulas involved in the paper include the objective function for multi - objective learning: \[ L_{\text{Speech - Mamba}}=\alpha L_{\text{CTC}}+(1 - \alpha) L_{\text{S2S}} \] where \(\alpha\in[0, 1]\), \(L_{\text{CTC}}\) is the CTC loss, and \(L_{\text{S2S}}\) is the S2S loss. Through these improvements, Speech - Mamba can process long - speech sequences more efficiently and exhibits superior performance over existing models on multiple test sets.