Improving Speech Enhancement Using Audio Tagging Knowledge from Pre-Trained Representations and Multi-Task Learning

Shaoxiong Lin,Chao Zhang,Yanmin Qian
DOI: https://doi.org/10.1109/asru57964.2023.10389687
2023-01-01
Abstract:In deep-learning-based speech enhancement (SE), an audio-knowledge-ignorant approach is often used, which estimates a denoising model to transform the noisy input speech into clean output speech without understanding the audio events that constitute the background noises. In this paper, an audio-knowledge-aware approach is proposed to improve SE, which explicitly leverages the knowledge from audio taggings to understand the background noises. Based on the recent progress in audio pattern analysis, the audio tagging knowledge is obtained using either additional input representations extracted by pre-trained audio tagging models, or from multi-task learning with extra audio event classification or regression tasks. Experimental results based on the DNS-2020 dataset and the pre-trained Wavegram-Logmel-CNN audio tagging model show that the proposed approach leads to considerable improvements in the STOI, SDR, and SI-SNR metrics.
What problem does this paper attempt to address?