Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement

Wenze Ren,Haibin Wu,Yi-Cheng Lin,Xuanjun Chen,Rong Chao,Kuo-Hsuan Hung,You-Jin Li,Wen-Yuan Ting,Hsin-Min Wang,Yu Tsao

2024-09-16

Abstract:In multichannel speech enhancement, effectively capturing spatial and spectral information across different microphones is crucial for noise reduction. Traditional methods, such as CNN or LSTM, attempt to model the temporal dynamics of full-band and sub-band spectral and spatial features. However, these approaches face limitations in fully modeling complex temporal dependencies, especially in dynamic acoustic environments. To overcome these challenges, we modify the current advanced model McNet by introducing an improved version of Mamba, a state-space model, and further propose MCMamba. MCMamba has been completely reengineered to integrate full-band and narrow-band spatial information with sub-band and full-band spectral features, providing a more comprehensive approach to modeling spatial and spectral information. Our experimental results demonstrate that MCMamba significantly improves the modeling of spatial and spectral features in multichannel speech enhancement, outperforming McNet and achieving state-of-the-art performance on the CHiME-3 dataset. Additionally, we find that Mamba performs exceptionally well in modeling spectral information.

Audio and Speech Processing,Sound

What problem does this paper attempt to address?

The paper aims to address issues in Multichannel Speech Enhancement (SE), particularly how to effectively capture spatial and spectral information between different microphones to improve speech quality in noisy environments. Traditional methods such as Convolutional Neural Networks (CNN) or Long Short-Term Memory Networks (LSTM) have limitations in modeling the temporal dynamics of full-bandwidth and sub-bandwidth spectral and spatial features, especially in dynamic acoustic environments where it is challenging to fully model complex temporal dependencies. To overcome these challenges, the authors improved the current state-of-the-art model McNet by introducing an enhanced version of the state-space model Mamba, and further proposed the MCMamba model. MCMamba, through redesign, integrates spatial information of both full-bandwidth and narrow-bandwidth with spectral features of sub-bandwidth and full-bandwidth, providing a more comprehensive approach to modeling spatial and spectral information. Experimental results show that MCMamba significantly enhances the modeling capability of spatial and spectral features in multichannel speech enhancement, surpassing McNet and achieving state-of-the-art performance on the CHiME-3 dataset. Additionally, the study found that Mamba excels particularly in modeling spectral information.

Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement

MC-SEMamba: A Simple Multi-channel Extension of SEMamba

Selective State Space Model for Monaural Speech Enhancement

An Investigation of Incorporating Mamba for Speech Enhancement

Mamba-SEUNet: Mamba UNet for Monaural Speech Enhancement

A Feature Integration Network for Multi-Channel Speech Enhancement

SepMamba: State-space models for speaker separation using Mamba

CMamba: Channel Correlation Enhanced State Space Models for Multivariate Time Series Forecasting

U-Mamba-Net: A highly efficient Mamba-based U-net style network for noisy and reverberant speech separation

C-Mamba: Channel Correlation Enhanced State Space Models for Multivariate Time Series Forecasting.

Multi-scale Informative Perceptual Network for Monaural Speech Enhancement

CMMamba: Channel Mixing Mamba for Time Series Forecasting

ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

SuperM2M: Supervised and Mixture-to-Mixture Co-Learning for Speech Enhancement and Robust ASR

Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers

An NMF-based MMSE Approach for Single Channel Speech Enhancement Using Densely Connected Convolutional Network

SPMamba: State-space model is all you need in speech separation

Monaural Speech Enhancement Using Deep Multi-Branch Residual Network with 1-D Causal Dilated Convolutions

A Lightweight and Real-Time Binaural Speech Enhancement Model with Spatial Cues Preservation