Abstract:We propose a novel speaker-dependent (SD) multi-condition (MC) training approach to joint learning of deep neural networks (DNNs) of acoustic models and an explicit speech separation structure for recognition of multi-talker mixed speech in a single-channel setting. First, an MC acoustic modeling framework is established to train a SD-DNN model in multi-talker scenarios. Such a recognizer significantly reduces the decoding complexity and improves the recognition accuracy over those using speaker-independent DNN models with a complicated joint decoding structure assuming the speaker identities in mixed speech are known. In addition, a SD regression DNN for mapping the acoustic features of mixed speech to the speech features of a target speaker is jointly trained with the SD-DNN based acoustic models. Experimental results on Speech Separation Challenge (SSC) small-vocabulary recognition show that the proposed approach under multi-condition training achieves an average word error rate (WER) of 3.8%, yielding a relative WER reduction of 65.1% from a top performance, DNN-based pre-processing only approach we proposed earlier under clean-condition training (Tu et al. 2016). Furthermore, the proposed joint training DNN framework generates a relative WER reduction of 13.2% from state-of-the-art systems under multi-condition training. Finally, the effectiveness of the proposed approach is also verified on the Wall Street Journal (WSJ0) task with medium-vocabulary continuous speech recognition in a simulated multi-talker setting.

TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings

TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings

Simultaneous Diarization and Separation of Meetings through the Integration of Statistical Mixture Models

Simultaneous Speech Extraction for Multiple Target Speakers under the Meeting Scenarios

Target-speaker Voice Activity Detection with Improved I-Vector Estimation for Unknown Number of Speaker

Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

Online speaker diarization of meetings guided by speech separation

Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization

A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition

Multi-channel Conversational Speaker Separation via Neural Diarization

Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization

Spatial-Temporal Activity-Informed Diarization and Separation

INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition

A Deep Analysis of Speech Separation Guided Diarization Under Realistic Conditions

One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

Spatial Diarization for Meeting Transcription with Ad-Hoc Acoustic Sensor Networks

EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers