Abstract:Recurrent neural networks (RNNs) can predict fundamental frequency (F0) for statistical parametric speech synthesis systems, given linguistic features as input. However, these models assume conditional independence between consecutive <inline-formula><tex-math notation="LaTeX">$F_0$</tex-math></inline-formula> values, given the RNN state. In a previous study, we proposed autoregressive (AR) neural <inline-formula><tex-math notation="LaTeX">$F_0$</tex-math></inline-formula> models to capture the causal dependency of successive <inline-formula><tex-math notation="LaTeX">$F_0$</tex-math></inline-formula> values. In subjective evaluations, a deep AR model (DAR) outperformed an RNN. Here, we propose a Vector Quantized Variational Autoencoder (VQ-VAE) neural <inline-formula><tex-math notation="LaTeX">$F_0$</tex-math></inline-formula> model that is both more efficient and more interpretable than the DAR. This model has two stages: one uses the VQ-VAE framework to learn a latent code for the <inline-formula><tex-math notation="LaTeX">$F_0$</tex-math></inline-formula> contour of each linguistic unit, and other learns to map from linguistic features to latent codes. In contrast to the DAR and RNN, which process the input linguistic features frame-by-frame, the new model converts one linguistic feature vector into one latent code for each linguistic unit. The new model achieves better objective scores than the DAR, has a smaller memory footprint and is computationally faster. Visualization of the latent codes for phones and moras reveals that each latent code represents an <inline-formula><tex-math notation="LaTeX">$F_0$</tex-math></inline-formula> shape for a linguistic unit.

Discourse-Level Prosody Modeling with a Variational Autoencoder for Non-Autoregressive Expressive Speech Synthesis

Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction

Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder

PE-wav2vec: A Prosody-Enhanced Speech Model for Self-Supervised Prosody Learning in TTS

A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis

Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data

Prosody Modelling with Pre-trained Cross-utterance Representations for Improved Speech Synthesis

Integrating Discrete Word-Level Style Variations into Non-Autoregressive Acoustic Models for Speech Synthesis

Cross-Utterance Conditioned VAE for Speech Generation

Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder

A Variational Prosody Model for Mapping the Context-Sensitive Variation of Functional Prosodic Prototypes

VAENAR-TTS: Variational Auto-Encoder Based Non-AutoRegressive Text-to-Speech Synthesis

StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis

Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech

A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion Analysis

Unsupervised Speech Representation Learning Using WaveNet Autoencoders

Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech

Unsupervised Speech Enhancement using Dynamical Variational Auto-Encoders

Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis

Text-aware and Context-aware Expressive Audiobook Speech Synthesis

Ensemble prosody prediction for expressive speech synthesis