Speech Boosting: Low-Latency Live Speech Enhancement for TWS Earbuds

Hanbin Bae,Pavel Andreev,Azat Saginbaev,Nicholas Babaev,Won-Jun Lee,Hosang Sung,Hoon-Young Cho
DOI: https://doi.org/10.21437/Interspeech.2024-1444
2024-09-27
Abstract:This paper introduces a speech enhancement solution tailored for true wireless stereo (TWS) earbuds on-device usage. The solution was specifically designed to support conversations in noisy environments, with active noise cancellation (ANC) activated. The primary challenges for speech enhancement models in this context arise from computational complexity that limits on-device usage and latency that must be less than 3 ms to preserve a live conversation. To address these issues, we evaluated several crucial design elements, including the network architecture and domain, design of loss functions, pruning method, and hardware-specific optimization. Consequently, we demonstrated substantial improvements in speech enhancement quality compared with that in baseline models, while simultaneously reducing the computational complexity and algorithmic latency.
Audio and Speech Processing,Artificial Intelligence,Sound,Signal Processing
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the issue of speech enhancement in True Wireless Stereo (TWS) earphones in noisy environments, particularly when Active Noise Cancellation (ANC) is activated. Specifically, the goal of the paper is to improve the quality of speech enhancement while ensuring low latency (less than 3 milliseconds) and reducing computational complexity for real-time application on the device. ### Main Challenges and Solutions 1. **Low Latency Requirement**: To maintain the smoothness of real-time conversations, the algorithm's latency must be controlled within 3 milliseconds. 2. **Computational Resource Constraints**: Efficient utilization of computational resources is crucial in practical applications, especially when running on the device. ### Research Methods and Results 1. **Network Architecture Selection**: A comparison between frequency-domain and time-domain network architectures revealed that the time-domain baseline model is more effective when allocating similar computational resources and algorithm latency. 2. **Loss Function Design**: The efficiency of adversarial loss was evaluated, and a two-stage training method combining Phone-Fortified Perceptual Loss (PFPL), adversarial loss, UTokyo-sarulab Mean Opinion Score (UTMOS), and Perceptual Evaluation Speech Quality (PESQ) was proposed. 3. **Pruning Methods**: Traditional magnitude pruning methods were compared with the novel Sparsity Profiles via Dynamic programming search (SPDY) + Optimal Brain Compression (OBC) method, and it was found that the SPDY+OBC method significantly improved the quality of the pruned model. 4. **Hardware Optimization**: The optimized model was simulated and tested on the Cadence Tensilica HiFi4 DSP, ultimately achieving 291 million cycles per second (MCPS) and approximately 800 kB in size. Through the above methods, the research team successfully developed a low-latency speech enhancement model suitable for device-end use, outperforming the baseline model and achieving high speech enhancement quality while ensuring low latency.