Abstract:Current automatic speech recognition systems struggle with modeling long speech sequences due to high quadratic complexity of Transformer-based models. Selective state space models such as Mamba has performed well on long-sequence modeling in natural language processing and computer vision tasks. However, research endeavors in speech technology tasks has been under-explored. We propose Speech-Mamba, which incorporates selective state space modeling in Transformer neural architectures. Long sequence representations with selective state space models in Speech-Mamba is complemented with lower-level representations from Transformer-based modeling. Speech-mamba achieves better capacity to model long-range dependencies, as it scales near-linearly with sequence length.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to address the challenges currently encountered by Automatic Speech Recognition (ASR) systems when processing long - speech sequences. Specifically, Transformer - based models perform poorly in modeling long - speech sequences due to their high quadratic complexity. The paper proposes a new method named Speech - Mamba, which combines Selective State Space Models (SSMs) with the Transformer architecture to more effectively capture long - range dependencies and improve the transcription accuracy of long - speech sequences. #### Main problems: 1. **Challenges in modeling long - speech sequences**: Existing Transformer - based ASR systems face the problem of high computational complexity when processing long - speech sequences, leading to a decline in model performance. 2. **Limitations of existing models**: Although Transformer performs well in low - level speech and text representations, it has deficiencies in handling long - distance dependencies. 3. **Exploring the application of selective state space models**: Although selective state space models (such as Mamba) perform well in natural language processing and computer vision tasks, their application in speech technology tasks has not been fully explored. ### Solutions To address the above problems, the paper proposes Speech - Mamba, which improves the modeling ability of long - speech sequences in the following ways: - **Combining selective state space models with Transformer**: Utilize the powerful convolutional and near - linear computational capabilities of the Mamba model to enhance the capture of long - range dependencies. - **Multi - objective learning**: Employ joint Connectionist Temporal Classification (CTC) and Sequence - to - Sequence (S2S) losses for training to improve the generalization ability of the model on different types of data. - **Experimental verification**: Through extensive experiments on the LibriSpeech dataset, the superior performance of Speech - Mamba in modeling long - speech sequences is verified. ### Formula representation The formulas involved in the paper include the objective function for multi - objective learning: \[ L_{\text{Speech - Mamba}}=\alpha L_{\text{CTC}}+(1 - \alpha) L_{\text{S2S}} \] where \(\alpha\in[0, 1]\), \(L_{\text{CTC}}\) is the CTC loss, and \(L_{\text{S2S}}\) is the S2S loss. Through these improvements, Speech - Mamba can process long - speech sequences more efficiently and exhibits superior performance over existing models on multiple test sets.

Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Selective State Space Model for Monaural Speech Enhancement

Exploring the Capability of Mamba in Speech Applications

Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation

Mamba in Speech: Towards an Alternative to Self-Attention

Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data

An Empirical Study of Mamba-based Language Models

Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis

Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition

VL-Mamba: Exploring State Space Models for Multimodal Learning

Mamba State-Space Models Are Lyapunov-Stable Learners

ReMamba: Equip Mamba with Effective Long-Sequence Modeling

Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges

A Survey of Mamba

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2