A lightweight dual-stage framework for personalized speech enhancement based on DeepFilterNet2

Thomas Serre,Mathieu Fontaine,Éric Benhaim,Geoffroy Dutour,Slim Essid
2024-04-11
Abstract:Isolating the desired speaker's voice amidst multiplespeakers in a noisy acoustic context is a challenging task. Per-sonalized speech enhancement (PSE) endeavours to achievethis by leveraging prior knowledge of the speaker's voice.Recent research efforts have yielded promising PSE mod-els, albeit often accompanied by computationally intensivearchitectures, unsuitable for resource-constrained embeddeddevices. In this paper, we introduce a novel method to per-sonalize a lightweight dual-stage Speech Enhancement (SE)model and implement it within DeepFilterNet2, a SE modelrenowned for its state-of-the-art performance. We seek anoptimal integration of speaker information within the model,exploring different positions for the integration of the speakerembeddings within the dual-stage enhancement architec-ture. We also investigate a tailored training strategy whenadapting DeepFilterNet2 to a PSE task. We show that ourpersonalization method greatly improves the performancesof DeepFilterNet2 while preserving minimal computationaloverhead.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
This paper presents a solution to the problem of Personalized Speech Enhancement (PSE), especially in extracting target speech in scenarios with multiple speakers and noisy environments. Although traditional deep neural network (DNN) frameworks perform well in speech enhancement, they often require high computational resources and are not suitable for resource-constrained embedded devices. The researchers introduce a new approach that personalizes the lightweight two-stage speech enhancement model, DeepFilterNet2, by integrating speaker information. The paper first introduces the importance of PSE, which utilizes pre-obtained speaker voice information to extract specific speaker's voice. They adopt a speaker encoder based on ECAPA-TDNN to obtain this information and input it into the enhancement model of DeepFilterNet2 as a clue for recognizing the target speech. The paper discusses different positions for integrating speaker embeddings in the two-stage enhancement architecture, as well as customized training strategies for the PSE task. The researchers propose a personalized version called pDeepFilterNet2, which integrates speaker information through two different encoder structures (unified encoder and dual encoder). Experimental results show that the personalized model performs significantly better in performance compared to the non-personalized DeepFilterNet2, especially in the presence of interfering speakers. The unified encoder structure performs the best, indicating that combining speaker embeddings with features can more effectively utilize the embedding information. Furthermore, despite the increased computational complexity of other personalized models, they still maintain relatively low complexity overall, making them suitable for real-time embedded device applications. Results on the blind test set demonstrate that although pDeepFilterNet2 is not as good as the larger-scale model TEA-PSE 3.0 in certain cases, it has significant lightweight advantages and can achieve real-time PSE on low-resource devices. In summary, the goal of the paper is to optimize lightweight speech enhancement models for PSE tasks, improve model performance by integrating speaker information, and maintain computational efficiency, providing an effective solution for real-time PSE on embedded devices.