Abstract:Personalizing a speech synthesis system is a highly desired application, where the system can generate speech with the user’s voice with rare enrolled recordings. There are two main approaches to build such a system in recent works: speaker adaptation and speaker encoding. On the one hand, speaker adaptation methods fine-tune a trained multi-speaker text-to-speech (TTS) model with few enrolled samples. However, they require at least thousands of fine-tuning steps for high-quality adaptation, making it hard to apply on devices. On the other hand, speaker encoding methods encode enrollment utterances into a speaker embedding. The trained TTS model can synthesize the user’s speech conditioned on the corresponding speaker embedding. Nevertheless, the speaker encoder suffers from the generalization gap between the seen and unseen speakers. In this paper, we propose applying a meta-learning algorithm to the speaker adaptation method. More specifically, we use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model, which aims to find a great meta-initialization to adapt the model to any few-shot speaker adaptation tasks quickly. Therefore, we can also adapt the meta-trained TTS model to unseen speakers efficiently. Our experiments compare the proposed method (Meta-TTS) with two baselines: a speaker adaptation method baseline and a speaker encoding method baseline. The evaluation results show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline and outperforms the speaker encoding baseline under the same training scheme. When the speaker encoder of the baseline is pre-trained with extra 8371 speakers of data, Meta-TTS can still outperform the baseline on LibriTTS dataset and achieve comparable results on VCTK dataset.

Bridging Mixture Density Networks with Meta-Learning for Automatic Speaker Identification

Designing Neural Speaker Embeddings with Meta Learning

Few-shot short utterance speaker verification using meta-learning

Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech

An HMM/MFNN Hybrid Architecture Based on Stacked Generalization for Speaker Identification

Multilingual Meta-Transfer Learning for Low-Resource Speech Recognition

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

Meta learning based audio tagging.

A Novel Discriminant Locality Preserving Projections for MDM-based Speaker Classification

SuperM2M: Supervised and Mixture-to-Mixture Co-Learning for Speech Enhancement and Robust ASR

Adapting Self-Supervised Models to Multi-Talker Speech Recognition Using Speaker Embeddings

MetaRL-SE: a few-shot speech enhancement method based on meta-reinforcement learning

Improving Speaker Identification for Shared Devices by Adapting Embeddings to Speaker Subsets

Meta-Learning Empowered Meta-Face: Personalized Speaking Style Adaptation for Audio-Driven 3D Talking Face Animation

Improved Meta-Learning Training for Speaker Verification

Personalized Acoustic Modeling by Weakly Supervised Multi-Task Deep Learning Using Acoustic Tokens Discovered from Unlabeled Data

Enhancing Open-Set Speaker Identification through Rapid Tuning with Speaker Reciprocal Points and Negative Sample

BSML: Bidirectional Sampling Aggregation-based Metric Learning for Low-resource Uyghur Few-shot Speaker Verification

Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances

Investigation Of Bottleneck Features And Multilingual Deep Neural Networks For Speaker Verification

Multi-task Metric Learning for Text-independent Speaker Verification