Implicit Enhancement of Target Speaker in Speaker-Adaptive ASR Through Efficient Joint Optimization

Minghui Wu,Haitao Tang,Jiahuan Fan,Ruoyu Wang,Hang Chen,Yanyong Zhang,Jun Du,Hengshun Zhou,Lei Sun,Xin Fang,Tian Gao,Genshun Wan,Jia Pan,Jianqing Gao
DOI: https://doi.org/10.1109/icassp48485.2024.10446845
2024-01-01
Abstract:In multi-speaker scenarios, automatic speech recognition (ASR) models rely on pre-processed audio after speaker separation. However, when the target speaker is not accurately separated, ASR models face limitations in reaching their peak performance. To address this issue, we propose a speaker-adaptive ASR framework that possesses more implicit target speaker enhancement capability by efficiently joint-optimized speaker recognition (SR) and ASR models. Our framework introduces sharing self-supervised learning representation, optimization transfer and hierarchy speaker-gated attention. In this manner, it can maximize effectiveness of embedding bias and emphasize target speaker corresponding to semantic units. In the CHiME-7 DASR sub-track, the proposed method achieves a 28.19% relative reduction in word error rate (WER) on the development sets when compared to the official baseline. Notably, this framework has also been employed in the champion system for the CHiME-7 DASR.
What problem does this paper attempt to address?