Abstract:Speech Enhancement (SE) is essential for improving productivity in remote collaborative environments. Although deep learning models are highly effective at SE, their computational demands make them impractical for embedded systems. Furthermore, acoustic conditions can change significantly in terms of difficulty, whereas neural networks are usually static with regard to the amount of computation performed. To this end, we introduce Dynamic Channel Pruning to the audio domain for the first time and apply it to a custom convolutional architecture for SE. Our approach works by identifying unnecessary convolutional channels at runtime and saving computational resources by not computing the activations for these channels and retrieving their filters. When trained to only use 25% of channels, we save 29.6% of MACs while only causing a 0.75% drop in PESQ. Thus, DynCP offers a promising path toward deploying larger and more powerful SE solutions on resource-constrained devices.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: when deploying an efficient Speech Enhancement (SE) model on resource - constrained embedded devices, how to balance computational efficiency and model performance. Specifically, although deep - learning models perform well in speech enhancement, their computational requirements make it difficult to practically apply these models in embedded systems. Moreover, acoustic conditions in the real world are highly variable, and the computational amount of traditional neural networks is usually fixed when handling tasks of different difficulties. To solve these problems, the author introduced the Dynamic Channel Pruning (DynCP) technique and applied it to a customized convolutional architecture. DynCP saves computational resources by identifying and skipping unnecessary convolutional channels at runtime, enabling the model to adaptively adjust the computational amount according to the complexity of the input data. This method not only improves the feasibility of deploying the model on resource - constrained devices but also can maintain the effect of speech enhancement to a certain extent. ### Main contributions: 1. **Proposed a fully convolutional architecture**: Based on depthwise - separable dilated convolution. 2. **Integrated a lightweight gating module**: Jointly trained with the backbone network to determine which channels can be skipped. 3. **Evaluated the dynamic architecture on popular datasets**: For speech - enhancement and noise - reduction tasks. 4. **Analyzed the influence of different hyper - parameters and training strategies on model performance**. ### Specific problem description: - **Computational resource limitations**: Embedded devices (such as earphones, speakers, etc.) have strict limitations in terms of energy, memory, and computing power, making it difficult to deploy state - of - the - art deep - learning SE solutions. - **Variations in acoustic conditions**: Acoustic environments in the real world are complex and variable, and static neural networks cannot flexibly respond to these changes, which may lead to poor performance or resource waste of the model in some cases. ### Solution: By introducing DynCP, the model can dynamically adjust the computational amount according to the difficulty of the input data during inference, thereby significantly reducing the computational cost while ensuring a certain performance. Experimental results show that when only 25% of the channels are used, the MACs are reduced by 29.6%, while the PESQ score only drops by 0.75%. ### Summary: This research provides a feasible path for deploying more powerful and efficient speech - enhancement solutions on resource - constrained devices, demonstrating the potential of dynamic neural networks in the field of audio processing.

Scalable Speech Enhancement with Dynamic Channel Pruning

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Compact Deep Neural Networks for Real-Time Speech Enhancement on Resource-Limited Devices

LiSenNet: Lightweight Sub-band and Dual-Path Modeling for Real-Time Speech Enhancement

Dynamic Encoder Size Based on Data-Driven Layer-wise Pruning for Speech Recognition

SECP: A Speech Enhancement-Based Curation Pipeline For Scalable Acquisition Of Clean Speech

Towards efficient models for real-time deep noise suppression

Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

A Multiobjective Learning and Ensembling Approach to High-Performance Speech Enhancement with Compact Neural Network Architectures

Modulating State Space Model with SlowFast Framework for Compute-Efficient Ultra Low-Latency Speech Enhancement

CheapNET: Improving Light-weight speech enhancement network by projected loss function

A lightweight dual-stage framework for personalized speech enhancement based on DeepFilterNet2

Speech enhancement deep-learning architecture for efficient edge processing

Restorative Speech Enhancement: A Progressive Approach Using SE and Codec Modules

DPSNN: Spiking Neural Network for Low-Latency Streaming Speech Enhancement

Personalized Speech Enhancement Without a Separate Speaker Embedding Model

Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed FP16-INT8 Post-Training Quantization

Bridging the Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition

Lite-RTSE: Exploring a Cost-Effective Lite DNN Model for Real-Time Speech Enhancement in RTC Scenarios

Efficient Encoder-Decoder and Dual-Path Conformer for Comprehensive Feature Learning in Speech Enhancement

Efficient High-Performance Bark-Scale Neural Network for Residual Echo and Noise Suppression