Towards Fine-Grained Prosody Control for Voice Conversion

Zheng Lian,Rongxiu Zhong,Zhengqi Wen,Bin Liu,Jianhua Tao
DOI: https://doi.org/10.1109/iscslp49672.2021.9362110
2021-01-24
Abstract:In a typical voice conversion system, previous works utilized various acoustic features (such as the pitch, voiced/unvoiced flag and aperiodicity) of the source speech to control the prosody of converted speech. However, prosody is related with many factors, such as the intonation, stress and rhythm. It is a challenging task to perfectly describe prosody through hand-crafted acoustic features. To address these difficulties, we propose to use prosody embeddings to describe prosody. These embeddings are learned from the source speech in an unsupervised manner. To verify the effectiveness of our proposed method, we conduct experiments on our Mandarin corpus. Experimental results show that our proposed method can improve the speech quality and speaker similarity of the converted speech. What’s more, we observe that our method can even achieve promising results in singing conditions.
What problem does this paper attempt to address?