Abstract:This paper presents CleanUMamba, a time-domain neural network architecture designed for real-time causal audio denoising directly applied to raw waveforms. CleanUMamba leverages a U-Net encoder-decoder structure, incorporating the Mamba state-space model in the bottleneck layer. By replacing conventional self-attention and LSTM mechanisms with Mamba, our architecture offers superior denoising performance while maintaining a constant memory footprint, enabling streaming operation. To enhance efficiency, we applied structured channel pruning, achieving an 8X reduction in model size without compromising audio quality. Our model demonstrates strong results in the Interspeech 2020 Deep Noise Suppression challenge. Specifically, CleanUMamba achieves a PESQ score of 2.42 and STOI of 95.1% with only 442K parameters and 468M MACs, matching or outperforming larger models in real-time performance. Code will be available at: <a class="link-external link-https" href="https://github.com/lab-emi/CleanUMamba" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is real - time causal audio denoising (speech denoising), especially the denoising problem directly applied to the original waveform in the time domain. Specifically, the author proposes a new architecture named CleanUMamba, aiming to improve existing methods in the following aspects: 1. **Improve denoising performance**: Replace the traditional self - attention mechanism and LSTM with the Mamba state - space model to achieve more efficient sequence modeling. 2. **Maintain a constant memory footprint**: Ensure that the model can run in a streaming processing environment while maintaining low latency. 3. **Reduce model size**: Significantly reduce the number of model parameters without affecting audio quality through structured channel pruning techniques. ### Specific problem description The goal of audio denoising is to recover the clean speech signal \(x\) from the speech signal \(y = x + v\) that contains background noise, where \(v\) is zero - mean noise uncorrelated with \(x\). For real - time applications, the model needs to reconstruct the clean speech \(\hat{x}_t\approx x_t\) on all noisy samples \(y_{1:t}\) at a given time \(t\). In practical applications, a slight look - ahead delay is allowed (such as 5 - 6 milliseconds for hearing aids or up to 200 milliseconds for video calls). ### Innovations of CleanUMamba - **U - Net encoder - decoder structure**: Combined with the application of the Mamba state - space model in the bottleneck layer, the model can process in a lower - resolution latent space, thereby reducing the computational load. - **Application of the Mamba model**: The Mamba model supports parallel computing during training and recursive processing during inference, and is not limited by the sequence length, which is very suitable for audio denoising tasks. - **Structured channel pruning**: Through periodic calibration of the GroupTaylor importance measure, an 8 - fold reduction in model size is achieved while maintaining high audio quality. ### Experimental results The author evaluated the performance of CleanUMamba in the Interspeech 2020 Deep Noise Suppression Challenge and compared it with other models. The experimental results show that CleanUMamba has achieved excellent results in indicators such as PESQ and STOI. In particular, in the case of 442K parameters and 468M MACs, the PESQ score is 2.42 and the STOI is 95.1%. In summary, the main contribution of this paper is to propose a new Mamba - based state - space model architecture for real - time audio denoising, and significantly reduce the model size through structured pruning techniques, thereby improving computational efficiency and performance.

CleanUMamba: A Compact Mamba Network for Speech Denoising using Channel Pruning

Multinoise-type Blind Denoising Using a Single Uniform Deep Convolutional Neural Network.

Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement

SSUMamba: Spatial-Spectral Selective State Space Model for Hyperspectral Image Denoising

Speech Enhancement Using U-Net with Compressed Sensing

CleanUNet 2: A Hybrid Speech Denoising Model on Waveform and Spectrogram

SMRU: Split-and-Merge Recurrent-based UNet for Acoustic Echo Cancellation and Noise Suppression

Efficient Seismic Data Denoising via Deep Learning With Improved MCA-SCUNet

SepMamba: State-space models for speaker separation using Mamba

A Wavenet for Speech Denoising

Align-ULCNet: Towards Low-Complexity and Robust Acoustic Echo and Noise Reduction

Selective State Space Model for Monaural Speech Enhancement

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

DENet: a deep architecture for audio surveillance applications

MSEMG: Surface Electromyography Denoising with a Mamba-based Efficient Network

A Hybrid Approach for Low-Complexity Joint Acoustic Echo and Noise Reduction

CT-Mamba: A Hybrid Convolutional State Space Model for Low-Dose CT Denoising

In-Vehicle Environment Noise Speech Enhancement Using Lightweight Wave-U-Net

DenoMamba: A fused state-space model for low-dose CT denoising

Ultra Low Complexity Deep Learning Based Noise Suppression

Towards speech enhancement using a variational U-Net architecture