A Compact Framework For Voice Conversion Using Wavenet Conditioned On Phonetic Posteriorgrams

Hui Lu,Zhiyong Wu,Runnan Li,Shiyin Kang,Jia Jia,Helen Meng
DOI: https://doi.org/10.1109/icassp.2019.8682938
2019-01-01
Abstract:Voice conversion can benefit from WaveNet vocoder with improvement in converted speech's naturalness and quality. However, nowadays approaches segregate the training of conversion module and WaveNet vocoder towards different optimization objectives, which might lead to the difficulty in model tuning and coordination. In this paper, we propose a compact framework to unify the conversion and the vocoder parts. Multi-head self-attention structure and bidirectional long short-term memory (BLSTM) recurrent neural network (RNN) are employed to encode speaker independent phonetic posteriorgrams (PPGs) into an intermediate representation which is used as the condition input of WaveNet to generate target speaker's waveform. In this way, we unify the conversion and vocoder parts into a compact system in which all parameters can be tuned simultaneously for global optimization. We compared the proposed method with the baseline system that consists of separately trained conversion module and WaveNet vocoder. Subjective evaluations show that the proposed method can achieve better results in both naturalness and speaker similarity.
What problem does this paper attempt to address?