Abstract:Isolating the desired speaker's voice amidst multiplespeakers in a noisy acoustic context is a challenging task. Per-sonalized speech enhancement (PSE) endeavours to achievethis by leveraging prior knowledge of the speaker's voice.Recent research efforts have yielded promising PSE mod-els, albeit often accompanied by computationally intensivearchitectures, unsuitable for resource-constrained embeddeddevices. In this paper, we introduce a novel method to per-sonalize a lightweight dual-stage Speech Enhancement (SE)model and implement it within DeepFilterNet2, a SE modelrenowned for its state-of-the-art performance. We seek anoptimal integration of speaker information within the model,exploring different positions for the integration of the speakerembeddings within the dual-stage enhancement architec-ture. We also investigate a tailored training strategy whenadapting DeepFilterNet2 to a PSE task. We show that ourpersonalization method greatly improves the performancesof DeepFilterNet2 while preserving minimal computationaloverhead.

What problem does this paper attempt to address?

This paper presents a solution to the problem of Personalized Speech Enhancement (PSE), especially in extracting target speech in scenarios with multiple speakers and noisy environments. Although traditional deep neural network (DNN) frameworks perform well in speech enhancement, they often require high computational resources and are not suitable for resource-constrained embedded devices. The researchers introduce a new approach that personalizes the lightweight two-stage speech enhancement model, DeepFilterNet2, by integrating speaker information. The paper first introduces the importance of PSE, which utilizes pre-obtained speaker voice information to extract specific speaker's voice. They adopt a speaker encoder based on ECAPA-TDNN to obtain this information and input it into the enhancement model of DeepFilterNet2 as a clue for recognizing the target speech. The paper discusses different positions for integrating speaker embeddings in the two-stage enhancement architecture, as well as customized training strategies for the PSE task. The researchers propose a personalized version called pDeepFilterNet2, which integrates speaker information through two different encoder structures (unified encoder and dual encoder). Experimental results show that the personalized model performs significantly better in performance compared to the non-personalized DeepFilterNet2, especially in the presence of interfering speakers. The unified encoder structure performs the best, indicating that combining speaker embeddings with features can more effectively utilize the embedding information. Furthermore, despite the increased computational complexity of other personalized models, they still maintain relatively low complexity overall, making them suitable for real-time embedded device applications. Results on the blind test set demonstrate that although pDeepFilterNet2 is not as good as the larger-scale model TEA-PSE 3.0 in certain cases, it has significant lightweight advantages and can achieve real-time PSE on low-resource devices. In summary, the goal of the paper is to optimize lightweight speech enhancement models for PSE tasks, improve model performance by integrating speaker information, and maintain computational efficiency, providing an effective solution for real-time PSE on embedded devices.

A lightweight dual-stage framework for personalized speech enhancement based on DeepFilterNet2

Personalized Speech Enhancement Without a Separate Speaker Embedding Model

DeepFilterNet: Perceptually Motivated Real-Time Speech Enhancement

TEA-PSE 2.0: Sub-Band Network for Real-Time Personalized Speech Enhancement.

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

Modulating State Space Model with SlowFast Framework for Compute-Efficient Ultra Low-Latency Speech Enhancement

Towards Environmental Preference Based Speech Enhancement For Individualised Multi-Modal Hearing Aids

A Multiobjective Learning and Ensembling Approach to High-Performance Speech Enhancement with Compact Neural Network Architectures

LiSenNet: Lightweight Sub-band and Dual-Path Modeling for Real-Time Speech Enhancement

DeepSpectrumLite: A Power-Efficient Transfer Learning Framework for Embedded Speech and Audio Processing From Decentralized Data

The Potential of Neural Speech Synthesis-based Data Augmentation for Personalized Speech Enhancement

Zero-shot test-time adaptation via knowledge distillation for personalized speech denoising and dereverberation

Cross-Attention is all you need: Real-Time Streaming Transformers for Personalised Speech Enhancement

Taylor, Can You Hear Me Now? A Taylor-Unfolding Framework for Monaural Speech Enhancement

A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech

PF-Net: Personalized Filter for Speaker Recognition from Raw Waveform

Guided Speech Enhancement Network

Efficient Personalized Speech Enhancement through Self-Supervised Learning

On real-time multi-stage speech enhancement systems

Speech Enhancement with Perceptually-motivated Optimization and Dual Transformations

The NPU-Elevoc Personalized Speech Enhancement System for ICASSP2023 DNS Challenge