Abstract:In acoustic modeling, speaker adaptive training (SAT) has been a long-standing technique for the traditional Gaussian mixture models (GMMs). Acoustic models trained with SAT become independent of training speakers and generalize better to unseen testing speakers. This paper ports the idea of SAT to deep neural networks (DNNs), and proposes a framework to perform feature-space SAT for DNNs. Using i-vectors as speaker representations, our framework learns an adaptation neural network to derive speaker-normalized features. Speaker adaptive models are obtained by fine-tuning DNNs in such a feature space. This framework can be applied to various feature types and network structures, posing a very general SAT solution. In this paper, we fully investigate how to build SAT-DNN models effectively and efficiently. First, we study the optimal configurations of SAT-DNNs for large-scale acoustic modeling tasks. Then, after presenting detailed comparisons between SAT-DNNs and the existing DNN adaptation methods, we propose to combine SAT-DNNs and model-space DNN adaptation during decoding. Finally, to accelerate learning of SAT-DNNs, a simple yet effective strategy, frame skipping, is employed to reduce the size of training data. Our experiments show that compared with a strong DNN baseline, the SAT-DNN model achieves 13.5% and 17.5% relative improvement on word error rates (WERs), without and with model-space adaptation applied respectively. Data reduction based on frame skipping results in 2× speed-up for SAT-DNN training, while causing negligible WER loss on the testing data.

Embedding-Based Speaker Adaptive Training of Deep Neural Networks

Speaker adaptive training of deep neural network acoustic models using i-vectors

Speaker Cluster-Based Speaker Adaptive Training for Deep Neural Network Acoustic Modeling

An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales.

Analyzing deep CNN-based utterance embeddings for acoustic model adaptation

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification

Deep Speaker: an End-to-End Neural Speaker Embedding System

Adapting End-to-End Neural Speaker Verification to New Languages and Recording Conditions with Adversarial Training

Learning Deep Embedding with Acoustic and Phoneme Features for Speaker Recognition in FM Broadcasting

VAE-based Domain Adaptation for Speaker Verification.

An Adaptive X-Vector Model for Text-Independent Speaker Verification

Unsupervised Speaker Adaptation Of Deep Neural Network Based On The Combination Of Speaker Codes And Singular Value Decomposition For Speech Recognition

Deep Neural Network Embeddings with Gating Mechanisms for Text-Independent Speaker Verification

Online Speaker Adaptation for LVCSR Based on Attention Mechanism

Speaker Adaptation and Adaptive Training for Jointly Optimised Tandem Systems.

Post-Training Embedding Alignment for Decoupling Enrollment and Runtime Speaker Recognition Models

Introducing Phonetic Information to Speaker Embedding for Speaker Verification

Investigation of Speaker-adaptation methods in Transformer based ASR

DSARSR: Deep Stacked Auto-encoders Enhanced Robust Speaker Recognition

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition