Non-Parallel Voice Conversion with Autoregressive Conversion Model and Duration Adjustment

Li-Juan Liu,Yan-Nian Chen,Jing-Xuan Zhang,Yuan Jiang,Ya-Jun Hu,Zhen-Hua Ling,Li-Rong Dai
DOI: https://doi.org/10.21437/vcc_bc.2020-17
2020-01-01
Abstract:Although N10 system in Voice Conversion Challenge 2018 (VCC 18) has achieved excellent voice conversion results in both speech naturalness and speaker similarity, the system’s performance is limited due to some modeling insufficiency. In this paper, we propose to overcome these limitations by introducing three modifications. First, we substitute an autoregressive-based model in order to improve the conversion model capability; second, we use high-fidelity WaveNet to model 24kHz/16bit waveform in order to improve conversion speech naturalness; third, a duration adjustment strategy is proposed to compensate the obvious speech rate difference between source and target speakers. Experimental results show that our proposed method can improve the conversion performance significantly. Furthermore, we validate the performance of this system for cross-lingual voice conversion by applying it directly to the cross-lingual task in Voice Conversion Challenge 2020 (VCC 2020). The released official subjective results show that our system obtains the best performance in conversion speech naturalness and comparable performance to the best system in speaker similarity, which indicate that our proposed method can achieve state-of-the-art cross-lingual voice conversion performance as well.
What problem does this paper attempt to address?