Abstract:We propose a novel speaker-dependent (SD) multi-condition (MC) training approach to joint learning of deep neural networks (DNNs) of acoustic models and an explicit speech separation structure for recognition of multi-talker mixed speech in a single-channel setting. First, an MC acoustic modeling framework is established to train a SD-DNN model in multi-talker scenarios. Such a recognizer significantly reduces the decoding complexity and improves the recognition accuracy over those using speaker-independent DNN models with a complicated joint decoding structure assuming the speaker identities in mixed speech are known. In addition, a SD regression DNN for mapping the acoustic features of mixed speech to the speech features of a target speaker is jointly trained with the SD-DNN based acoustic models. Experimental results on Speech Separation Challenge (SSC) small-vocabulary recognition show that the proposed approach under multi-condition training achieves an average word error rate (WER) of 3.8%, yielding a relative WER reduction of 65.1% from a top performance, DNN-based pre-processing only approach we proposed earlier under clean-condition training (Tu et al. 2016). Furthermore, the proposed joint training DNN framework generates a relative WER reduction of 13.2% from state-of-the-art systems under multi-condition training. Finally, the effectiveness of the proposed approach is also verified on the Wall Street Journal (WSJ0) task with medium-vocabulary continuous speech recognition in a simulated multi-talker setting.

Probabilistic Speaker-Class Based Acoustic Modeling for Large Vocabulary Continuous Speech Recognition

Probabilistic Latent Speaker Training for Large Vocabulary Speech Recognition

Probabilistic Latent Speaker Analysis for Large Vocabulary Speech Recognition

PHMM Based Asynchronous Acoustic Model for Chinese Large Vocabulary Continuous Speech Recognition

Hidden Markov Acoustic Modeling with Bootstrap and Restructuring for Low-Resourced Languages

Semi-continuous Segmental Probability Modeling for Continuous Speech Recognition.

From Linear Prediction HMM to a New Combined Model for Speech Recognition

Speaker recognition using continuous density support vector machines

Modeling Speaker Variability Using Long Short-Term Memory Networks For Speech Recognition

Discriminative training of GMM-HMM acoustic model by RPCL type Bayesian Ying-Yang harmony learning

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

Replacing Uncertainty Decoding with Subband Re-Estimation for Large Vocabulary Speech Recognition in Noise.

Research on Context-Dependent Acoustical Unit (Triphone) for Mandarin Continuous Speech Recognition

VB-HMM Speaker Diarization with Enhanced and Refined Segment Representation.

A Study of Bootstrapping with Multiple Acoustic Features for Improved Automatic Speech Recognition

A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary Chinese speech recognition

Lightly supervised acoustic model training for mandarin continuous speech recognition

Stereo-based Stochastic Mapping with Context Using Probabilistic PCA for Noise Robust Automatic Speech Recognition

Context Dependent Syllable Acoustic Model For Continuous Chinese Speech Recognition

Research on Inter-Syllable Context-Dependent Acoustic Unit for Mandarin Continuous Speech Recognition.

Discriminative Dynamic Gaussian Mixture Selection with Enhanced Robustness and Performance for Multi-Accent Speech Recognition