Bi-Level Speaker Supervision for One-Shot Speech Synthesis

Tao Wang,Jianhua Tao,Ruibo Fu,Jiangyan Yi,Zhengqi Wen,Chunyu Qiang
DOI: https://doi.org/10.21437/interspeech.2020-1737
2020-01-01
Abstract:The gap between speaker characteristics of reference speech and synthesized speech remains a challenging problem in oneshot speech synthesis. In this paper, we propose a bi-level speaker supervision framework to close the speaker characteristics gap via supervising the synthesized speech at speaker feature level and speaker identity level. The speaker feature extraction and speaker identity reconstruction are integrated in an end-to-end speech synthesis network, with the one on speaker feature level for closing speaker characteristics and the other on speaker identity level for preserving identity information. This framework guarantees that the synthesized speech has similar speaker characteristics to original speech, and it also ensures the distinguishability between different speakers. Additionally, to solve the influence of speech content on speaker feature extraction task, we propose a text-independent reference encoder (ti-reference encoder) module to extract speaker feature. Experiments on LibriTTS dataset show that our model is able to generate the speech similar to target speaker. Furthermore, we demonstrate that this model can learn meaningful speaker representations by bi-level speaker supervision and ti-reference encoder module.
What problem does this paper attempt to address?