Adaptive Permutation Invariant Training with Auxiliary Information for Monaural Multi-Talker Speech Recognition.

Xuankai Chang,Yanmin Qian,Dong Yu
DOI: https://doi.org/10.1109/icassp.2018.8461570
2018-01-01
Abstract:In this paper, we extend our previous work on direct recognition of single-channel multi-talker mixed speech using permutation invariant training (PIT). We propose to adapt the PIT models with auxiliary features such as pitch and i-vector, and to exploit the gender information with multi-task learning which jointly optimizes for the speech recognition and speaker-pair prediction. We also compare CNN-BLSTMs against BLSTM-RNNs used in our previous PIT-ASR model. The experimental results on the artificially mixed two-talker AMI data indicate that our proposed model improvements can reduce word error rate (WER) by ~ 10.0% relative to our previous work for both speakers in the mixed speech. Our results also confirm that PIT can be easily combined with advanced techniques to improve the performance on multi-talker speech recognition.
What problem does this paper attempt to address?