An audio-quality-based multi-strategy approach for target speaker extraction in the MISP 2023 Challenge

Runduo Han,Xiaopeng Yan,Weiming Xu,Pengcheng Guo,Jiayao Sun,He Wang,Quan Lu,Ning Jiang,Lei Xie
2024-03-07
Abstract:This paper describes our audio-quality-based multi-strategy approach for the audio-visual target speaker extraction (AVTSE) task in the Multi-modal Information based Speech Processing (MISP) 2023 Challenge. Specifically, our approach adopts different extraction strategies based on the audio quality, striking a balance between interference removal and speech preservation, which benifits the back-end automatic speech recognition (ASR) systems. Experiments show that our approach achieves a character error rate (CER) of 24.2% and 33.2% on the Dev and Eval set, respectively, obtaining the second place in the challenge.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
This paper proposes a solution to the Audio-Visual Target Speaker Extraction (AVTSE) problem in the Multimodal Information Processing (MISP) 2023 Challenge. The goal is to extract the speech of specific speakers in complex acoustic environments, including background noise and interference from multiple speakers. Traditional methods rely on pre-recorded audio of the target speaker, which limits their practicality. Therefore, the MISP 2023 Challenge introduces lip motion video data as prior information. In the paper, the authors propose a multi-strategy approach based on audio quality. Firstly, they divide the audio into three quality groups: high, medium, and low, based on the DNSMOS OVRL score, and apply different extraction techniques to each group. For high-quality audio, they directly use Guided Source Separation (GSS). For medium-quality audio, they combine the output of GSS with lip motion video data using a multi-channel fusion method for further extraction. For low-quality audio, they utilize the DRC-NET network for noise reduction. Experimental results show that this method achieved a Character Error Rate (CER) of 24.2% on the development set and 33.2% on the evaluation set, obtaining the second place in the challenge. The research also demonstrates that adopting different strategies for different audio qualities is crucial for the performance of the backend Automatic Speech Recognition (ASR) system.