Scalable Speech Enhancement with Dynamic Channel Pruning

Riccardo Miccini,Clement Laroche,Tobias Piechowiak,Luca Pezzarossa
2024-12-23
Abstract:Speech Enhancement (SE) is essential for improving productivity in remote collaborative environments. Although deep learning models are highly effective at SE, their computational demands make them impractical for embedded systems. Furthermore, acoustic conditions can change significantly in terms of difficulty, whereas neural networks are usually static with regard to the amount of computation performed. To this end, we introduce Dynamic Channel Pruning to the audio domain for the first time and apply it to a custom convolutional architecture for SE. Our approach works by identifying unnecessary convolutional channels at runtime and saving computational resources by not computing the activations for these channels and retrieving their filters. When trained to only use 25% of channels, we save 29.6% of MACs while only causing a 0.75% drop in PESQ. Thus, DynCP offers a promising path toward deploying larger and more powerful SE solutions on resource-constrained devices.
Audio and Speech Processing,Machine Learning,Sound
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: when deploying an efficient Speech Enhancement (SE) model on resource - constrained embedded devices, how to balance computational efficiency and model performance. Specifically, although deep - learning models perform well in speech enhancement, their computational requirements make it difficult to practically apply these models in embedded systems. Moreover, acoustic conditions in the real world are highly variable, and the computational amount of traditional neural networks is usually fixed when handling tasks of different difficulties. To solve these problems, the author introduced the Dynamic Channel Pruning (DynCP) technique and applied it to a customized convolutional architecture. DynCP saves computational resources by identifying and skipping unnecessary convolutional channels at runtime, enabling the model to adaptively adjust the computational amount according to the complexity of the input data. This method not only improves the feasibility of deploying the model on resource - constrained devices but also can maintain the effect of speech enhancement to a certain extent. ### Main contributions: 1. **Proposed a fully convolutional architecture**: Based on depthwise - separable dilated convolution. 2. **Integrated a lightweight gating module**: Jointly trained with the backbone network to determine which channels can be skipped. 3. **Evaluated the dynamic architecture on popular datasets**: For speech - enhancement and noise - reduction tasks. 4. **Analyzed the influence of different hyper - parameters and training strategies on model performance**. ### Specific problem description: - **Computational resource limitations**: Embedded devices (such as earphones, speakers, etc.) have strict limitations in terms of energy, memory, and computing power, making it difficult to deploy state - of - the - art deep - learning SE solutions. - **Variations in acoustic conditions**: Acoustic environments in the real world are complex and variable, and static neural networks cannot flexibly respond to these changes, which may lead to poor performance or resource waste of the model in some cases. ### Solution: By introducing DynCP, the model can dynamically adjust the computational amount according to the difficulty of the input data during inference, thereby significantly reducing the computational cost while ensuring a certain performance. Experimental results show that when only 25% of the channels are used, the MACs are reduced by 29.6%, while the PESQ score only drops by 0.75%. ### Summary: This research provides a feasible path for deploying more powerful and efficient speech - enhancement solutions on resource - constrained devices, demonstrating the potential of dynamic neural networks in the field of audio processing.