Abstract:Recurrent neural networks (RNNs) can predict fundamental frequency (F0) for statistical parametric speech synthesis systems, given linguistic features as input. However, these models assume conditional independence between consecutive <inline-formula><tex-math notation="LaTeX">$F_0$</tex-math></inline-formula> values, given the RNN state. In a previous study, we proposed autoregressive (AR) neural <inline-formula><tex-math notation="LaTeX">$F_0$</tex-math></inline-formula> models to capture the causal dependency of successive <inline-formula><tex-math notation="LaTeX">$F_0$</tex-math></inline-formula> values. In subjective evaluations, a deep AR model (DAR) outperformed an RNN. Here, we propose a Vector Quantized Variational Autoencoder (VQ-VAE) neural <inline-formula><tex-math notation="LaTeX">$F_0$</tex-math></inline-formula> model that is both more efficient and more interpretable than the DAR. This model has two stages: one uses the VQ-VAE framework to learn a latent code for the <inline-formula><tex-math notation="LaTeX">$F_0$</tex-math></inline-formula> contour of each linguistic unit, and other learns to map from linguistic features to latent codes. In contrast to the DAR and RNN, which process the input linguistic features frame-by-frame, the new model converts one linguistic feature vector into one latent code for each linguistic unit. The new model achieves better objective scores than the DAR, has a smaller memory footprint and is computationally faster. Visualization of the latent codes for phones and moras reveals that each latent code represents an <inline-formula><tex-math notation="LaTeX">$F_0$</tex-math></inline-formula> shape for a linguistic unit.

Discrete Acoustic Space for an Efficient Sampling in Neural Text-To-Speech

A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis

Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior

SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization

VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

SVQ-MAE: an efficient speech pre-training framework with constrained computational resources

ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers

Enhancing audio quality for expressive Neural Text-to-Speech

Srcodec: Split-Residual Vector Quantization for Neural Speech Codec.

Neural Discrete Representation Learning

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition

Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net Architecture

ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs

VAENAR-TTS: Variational Auto-Encoder Based Non-AutoRegressive Text-to-Speech Synthesis