CleanUMamba: A Compact Mamba Network for Speech Denoising using Channel Pruning

Sjoerd Groot,Qinyu Chen,Jan C. van Gemert,Chang Gao
2024-10-15
Abstract:This paper presents CleanUMamba, a time-domain neural network architecture designed for real-time causal audio denoising directly applied to raw waveforms. CleanUMamba leverages a U-Net encoder-decoder structure, incorporating the Mamba state-space model in the bottleneck layer. By replacing conventional self-attention and LSTM mechanisms with Mamba, our architecture offers superior denoising performance while maintaining a constant memory footprint, enabling streaming operation. To enhance efficiency, we applied structured channel pruning, achieving an 8X reduction in model size without compromising audio quality. Our model demonstrates strong results in the Interspeech 2020 Deep Noise Suppression challenge. Specifically, CleanUMamba achieves a PESQ score of 2.42 and STOI of 95.1% with only 442K parameters and 468M MACs, matching or outperforming larger models in real-time performance. Code will be available at: <a class="link-external link-https" href="https://github.com/lab-emi/CleanUMamba" rel="external noopener nofollow">this https URL</a>
Sound,Artificial Intelligence,Computer Vision and Pattern Recognition,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is real - time causal audio denoising (speech denoising), especially the denoising problem directly applied to the original waveform in the time domain. Specifically, the author proposes a new architecture named CleanUMamba, aiming to improve existing methods in the following aspects: 1. **Improve denoising performance**: Replace the traditional self - attention mechanism and LSTM with the Mamba state - space model to achieve more efficient sequence modeling. 2. **Maintain a constant memory footprint**: Ensure that the model can run in a streaming processing environment while maintaining low latency. 3. **Reduce model size**: Significantly reduce the number of model parameters without affecting audio quality through structured channel pruning techniques. ### Specific problem description The goal of audio denoising is to recover the clean speech signal \(x\) from the speech signal \(y = x + v\) that contains background noise, where \(v\) is zero - mean noise uncorrelated with \(x\). For real - time applications, the model needs to reconstruct the clean speech \(\hat{x}_t\approx x_t\) on all noisy samples \(y_{1:t}\) at a given time \(t\). In practical applications, a slight look - ahead delay is allowed (such as 5 - 6 milliseconds for hearing aids or up to 200 milliseconds for video calls). ### Innovations of CleanUMamba - **U - Net encoder - decoder structure**: Combined with the application of the Mamba state - space model in the bottleneck layer, the model can process in a lower - resolution latent space, thereby reducing the computational load. - **Application of the Mamba model**: The Mamba model supports parallel computing during training and recursive processing during inference, and is not limited by the sequence length, which is very suitable for audio denoising tasks. - **Structured channel pruning**: Through periodic calibration of the GroupTaylor importance measure, an 8 - fold reduction in model size is achieved while maintaining high audio quality. ### Experimental results The author evaluated the performance of CleanUMamba in the Interspeech 2020 Deep Noise Suppression Challenge and compared it with other models. The experimental results show that CleanUMamba has achieved excellent results in indicators such as PESQ and STOI. In particular, in the case of 442K parameters and 468M MACs, the PESQ score is 2.42 and the STOI is 95.1%. In summary, the main contribution of this paper is to propose a new Mamba - based state - space model architecture for real - time audio denoising, and significantly reduce the model size through structured pruning techniques, thereby improving computational efficiency and performance.