Separate-to-Recognize: Joint Multi-target Speech Separation and Speech Recognition for Speaker-attributed ASR

Yuxiao Lin,Zhihao Du,Shiliang Zhang,Fan Yu,Zhou Zhao,Fei Wu
DOI: https://doi.org/10.1109/ISCSLP57327.2022.10037902
2022-01-01
Abstract:In this paper, we propose a joint framework for speaker-attributed automatic speech recognition (SA-ASR) task named Separate-to-Recognize. The proposed framework combines multi-target speech separation and speech recognition modules into a single end-to-end model. It takes mixed speech utterances and target-speaker embeddings as input and predicts separated speech and transcription for each speaker. In the multi-target speech separation module, mixed speakers are separated at the same time, which is different from existing single-target separation methods. Furthermore, we develop a dual-path Conformer-based separator which improves dual-path time domain separation by utilizing the modeling ability of local relationship from Conformer. We also explore different schemas for joint training modules and propose a training strategy that can better coordinate the two modules in our model. By comparing with different model structures and training strategies in experiments, we demonstrate the effectiveness of the proposed multi-target separation module and dual-path Conformer based separator. Experimental results also show that our framework can be generalized to different neural network architectures.
What problem does this paper attempt to address?