Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation

Tao Wang,Jianhua Tao,Ruibo Fu,Jiangyan Yi,Zhengqi Wen,Rongxiu Zhong
DOI: https://doi.org/10.21437/interspeech.2020-1745
2020-01-01
Abstract:The low similarity and naturalness of synthesized speech remain a challenging problem for speaker adaptation with few resources. Since the acoustic model is too complex to interpret, overfitting will occur when training with few data. To prevent the model from overfitting, this paper proposes a novel speaker adaptation framework that decomposes the parameter space of the end-to-end acoustic model into two parts, with the one on predicting spoken content and the other on modeling speaker's voice. The spoken content is represented by phone posteriorgram (PPG) which is speaker independent. By adapting the two sub-modules separately, the overfitting can be alleviated effectively. Moreover, we propose two different adaptation strategies based on whether the data has text annotation. In this way, speaker adaptation can also be performed without text annotations. Experimental results confirm the adaptability of our proposed method of factorizating spoken content and voice. Listening tests demonstrate that our proposed method can achieve better performance with just 10 sentences than speaker adaptation conducted on Tacotron in terms of naturalness and speaker similarity.
What problem does this paper attempt to address?