Abstract:In our previous work, we introduced our attention-based speaker adaptation method, which has been proved to be an efficient online speaker adaptation method for real-time speech recognition. In this paper, we present a more complete framework of this method named memory-aware networks, which consists of the main network, the memory module, the attention module and the connection module. A gate mechanism and a multiple-connections strategy are presented to connect the memory with the main network in order to take full advantage of the memory. An auxiliary speaker classification task is provided to improve the accuracy of the attention module. The fixed-size ordinally forgetting encoding method is used together with average pooling to gather both short-term and long-term information. Furthermore, instead of only using traditional speaker embeddings such as i-vectors or d-vectors as the memory, we design a new form of memory called residual vectors, which can represent different pronunciation habits. Experiments on both the Switchboard and AISHELL-2 tasks show that our method can perform online speaker adaptation very well with no additional adaptation data and with only a relative 3% increase in decoding computation complexity. Under the cross-entropy criterion, our method achieves a relative word error rate reduction of 9.4% and 8.3% compared to that of the speaker-independent model on the Switchboard task and the AISHELL-2 task, respectively, and approximately 7.0% compared to that of the traditional d-vector-based speaker adaptation method.

A Combined Speaker Adaptation Method for Mandarin Speech Recognition

A Speaker Adaptation Algorithm Based on Matrix Linear Interpolation

Agmma: A Novel Incremental Adaptation Method And Its Application To Speaker Recognition

MAP-based Speaker Adaptation in Speech Synthesis

Speech Recognition Using Speaker Adaptation by System Parameter Transformation.

Speaker adaptation using maximum likelihood model interpolation

Speaker adaptation based on combination of MAP estimation and weighted neighbor regression

Phoneme Dependent Speaker Embedding And Model Factorization For Multi-Speaker Speech Synthesis And Adaptation

Speaker Adaptation with MAP Estimation and Weighted Neighbor Regression

Codebook-Based Speaker Adaptation

Cross-Lingual Speaker Adaptation for HMM-Based Speech Synthesis

Speaker Normalization Training and Adaptation for Speech Recognition

A New Subspace Based Speaker Adaptation Method

Label Transform Based Cross-Language Speaker Adaptation in Bilingual (Mandarin-English) TTS

Interpolation adaptation algorithm based on gaussian similarity analysis

An New Approach for Incremental Speaker Adaptation

Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition

Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis

Online Speaker Adaptation Using Memory-Aware Networks for Speech Recognition

Dynamic Speaker Selected Training for Rapid Speaker Adaptation

An Improved Cross-Language Model Adaptation Method for Speech Synthesis