SIR-Progressive Audio-Visual TF-Gridnet with ASR-Aware Selector for Target Speaker Extraction in MISP 2023 Challenge

Zhongshu Hou,Tianchi Sun,Yuxiang Hu,Changbao Zhu,Kai Chen,Jing Lu
DOI: https://doi.org/10.1109/icasspw62465.2024.10626417
2024-01-01
Abstract:TF-GridNet has demonstrated its effectiveness in speech separation and enhancement. In this paper, we extend its capabilities for progressive audio-visual speech enhancement by introducing an attention-based audio-visual fusion module and a progressive learning strategy based on the signal-to-interference ratio (SIR). The model is integrated with a prior guided source separation (GSS) process for robust target speech extraction. A subsequent automatic speech recognition (ASR)-aware selector is employed to choose the enhancement output for better ASR performance. The proposed system achieves a final character error rate (CER) of 33.18% on the evaluation set and ranks first in the ICASSP 2024 Signal Processing Grand Challenge: Multimodal Information based Speech Processing (MISP) 2023 Challenge.
What problem does this paper attempt to address?