Monaural Multi-Talker Speech Recognition With Attention Mechanism And Gated Convolutional Networks

Xuankai Chang,Yanmin Qian,Dong Yu
DOI: https://doi.org/10.21437/Interspeech.2018-1547
2018-01-01
Abstract:To improve the speech recognition accuracy under the multi talker scenario, we propose a novel model architecture that incorporates the attention mechanism and gated convolutional network (GCN) into our previously developed permutation invariant training based multi-talker speech recognition system (PIT-ASR). The new architecture has three components: an encoding transformer, an attention module and a frame-level senone predictor. The encoding transformer first transforms a mixed speech sequence into a sequence of embedding vectors. Then the attention mechanism extracts individual context vectors from this embedding sequence for different speaker sources. Finally the predictor generates the senone posteriors for all speaker sources independently with the knowledge from the context vectors. To get better embedding representations we explore gated convolutional networks in the encoding transformer. The experimental results on the artificially mixed two talker WSJO corpus show that our proposed model can reduce the word error rate (WER) by more than 15% relatively compared to our previous PIT-ASR system.
What problem does this paper attempt to address?