Abstract:This paper introduces a speech enhancement solution tailored for true wireless stereo (TWS) earbuds on-device usage. The solution was specifically designed to support conversations in noisy environments, with active noise cancellation (ANC) activated. The primary challenges for speech enhancement models in this context arise from computational complexity that limits on-device usage and latency that must be less than 3 ms to preserve a live conversation. To address these issues, we evaluated several crucial design elements, including the network architecture and domain, design of loss functions, pruning method, and hardware-specific optimization. Consequently, we demonstrated substantial improvements in speech enhancement quality compared with that in baseline models, while simultaneously reducing the computational complexity and algorithmic latency.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the issue of speech enhancement in True Wireless Stereo (TWS) earphones in noisy environments, particularly when Active Noise Cancellation (ANC) is activated. Specifically, the goal of the paper is to improve the quality of speech enhancement while ensuring low latency (less than 3 milliseconds) and reducing computational complexity for real-time application on the device. ### Main Challenges and Solutions 1. **Low Latency Requirement**: To maintain the smoothness of real-time conversations, the algorithm's latency must be controlled within 3 milliseconds. 2. **Computational Resource Constraints**: Efficient utilization of computational resources is crucial in practical applications, especially when running on the device. ### Research Methods and Results 1. **Network Architecture Selection**: A comparison between frequency-domain and time-domain network architectures revealed that the time-domain baseline model is more effective when allocating similar computational resources and algorithm latency. 2. **Loss Function Design**: The efficiency of adversarial loss was evaluated, and a two-stage training method combining Phone-Fortified Perceptual Loss (PFPL), adversarial loss, UTokyo-sarulab Mean Opinion Score (UTMOS), and Perceptual Evaluation Speech Quality (PESQ) was proposed. 3. **Pruning Methods**: Traditional magnitude pruning methods were compared with the novel Sparsity Profiles via Dynamic programming search (SPDY) + Optimal Brain Compression (OBC) method, and it was found that the SPDY+OBC method significantly improved the quality of the pruned model. 4. **Hardware Optimization**: The optimized model was simulated and tested on the Cadence Tensilica HiFi4 DSP, ultimately achieving 291 million cycles per second (MCPS) and approximately 800 kB in size. Through the above methods, the research team successfully developed a low-latency speech enhancement model suitable for device-end use, outperforming the baseline model and achieving high speech enhancement quality while ensuring low latency.

Speech Boosting: Low-Latency Live Speech Enhancement for TWS Earbuds

ClearBuds: Wireless Binaural Earbuds for Learning-Based Speech Enhancement

A Lightweight and Real-Time Binaural Speech Enhancement Model with Spatial Cues Preservation

EarSpeech: Exploring In-Ear Occlusion Effect on Earphones for Data-efficient Airborne Speech Enhancement

Deep Neural Network Based Noised Asian Speech Enhancement and Its Implementation on a Hearing Aid App.

Towards sub-millisecond latency real-time speech enhancement models on hearables

Ultra-Low Latency Speech Enhancement - A Comprehensive Study

A Supervised Speech Enhancement Method for Smartphone-Based Binaural Hearing Aids

A Smart Binaural Hearing Aid Architecture Leveraging a Smartphone APP with Deep-Learning Speech Enhancement.

Wavoice: A mmWave-assisted Noise-resistant Speech Recognition SystemJust Accepted

Enabling Real-Time On-Chip Audio Super Resolution for Bone-Conduction Microphones

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

Efficient High-Performance Bark-Scale Neural Network for Residual Echo and Noise Suppression

Deep Learning Based Real-Time Speech Enhancement for Dual-Microphone Mobile Phones

BAE-Net: A Low complexity and high fidelity Bandwidth-Adaptive neural network for speech super-resolution

Speech enhancement deep-learning architecture for efficient edge processing

Towards Ultra-Low-Power Neuromorphic Speech Enhancement with Spiking-FullSubNet

A Speech Enhancement Method Combining Beamforming with RNN for Hearing Aids.

Exploring Speech Enhancement for Low-resource Speech Synthesis

In-Ear-Voice: Towards Milli-Watt Audio Enhancement With Bone-Conduction Microphones for In-Ear Sensing Platforms