Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction
Yi Zhao,Haoyu Li,Cheng-I Lai,Jennifer Williams,Erica Cooper,Junichi Yamagishi
DOI: https://doi.org/10.48550/arXiv.2005.07884
2020-05-16
Abstract:Vector Quantized Variational AutoEncoders (VQ-VAE) are a powerful representation learning framework that can discover discrete groups of features from a speech signal without supervision. Until now, the VQ-VAE architecture has previously modeled individual types of speech features, such as only phones or only F0. This paper introduces an important extension to VQ-VAE for learning F0-related suprasegmental information simultaneously along with traditional phone <a class="link-external link-http" href="http://features.The" rel="external noopener nofollow">this http URL</a> proposed framework uses two encoders such that the F0 trajectory and speech waveform are both input to the system, therefore two separate codebooks are learned. We used a WaveRNN vocoder as the decoder component of VQ-VAE. Our speaker-independent VQ-VAE was trained with raw speech waveforms from multi-speaker Japanese speech databases. Experimental results show that the proposed extension reduces F0 distortion of reconstructed speech for all unseen test speakers, and results in significantly higher preference scores from a listening test. We additionally conducted experiments using single-speaker Mandarin speech to demonstrate advantages of our architecture in another language which relies heavily on F0.
Audio and Speech Processing,Sound