Abstract:This paper presents a new spectral modeling method for statistical parametric speech synthesis. In the conventional methods, high-level spectral parameters, such as mel-cepstra or line spectral pairs, are adopted as the features for hidden Markov model (HMM)-based parametric speech synthesis. Our proposed method described in this paper improves the conventional method in two ways. First, distributions of low-level, un-transformed spectral envelopes (extracted by the STRAIGHT vocoder) are used as the parameters for synthesis. Second, instead of using single Gaussian distribution, we adopt the graphical models with multiple hidden variables, including restricted Boltzmann machines (RBM) and deep belief networks (DBN), to represent the distribution of the low-level spectral envelopes at each HMM state. At the synthesis time, the spectral envelopes are predicted from the RBM-HMMs or the DBN-HMMs of the input sentence following the maximum output probability parameter generation criterion with the constraints of the dynamic features. A Gaussian approximation is applied to the marginal distribution of the visible stochastic variables in the RBM or DBN at each HMM state in order to achieve a closed-form solution to the parameter generation problem. Our experimental results show that both RBM-HMM and DBN-HMM are able to generate spectral envelope parameter sequences better than the conventional Gaussian-HMM with superior generalization capabilities and that DBN-HMM and RBM-HMM perform similarly due possibly to the use of Gaussian approximation. As a result, our proposed method can significantly alleviate the over-smoothing effect and improve the naturalness of the conventional HMM-based speech synthesis system using mel-cepstra.

Lip Movement Generation Using Restricted Boltzmann Machines For Visual Speech Synthesis

Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis

Lip Movements Generation at a Glance

Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis

Modeling Spectral Envelopes Using Deep Conditional Restricted Boltzmann Machines for Statistical Parametric Speech Synthesis.

Whisper-to-speech Conversion Using Restricted Boltzmann Machine Arrays

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

Joint Spectral Distribution Modeling Using Restricted Boltzmann Machines For Voice Conversion

Real-time Lip Synchronization Based on Hidden Markov Models

Speech driven photo realistic facial animation based on an articulatory DBN model and AAM features

VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

Learning Speaker-specific Lip-to-Speech Generation

Voice Conversion Using Conditional Restricted Boltzmann Machine

MILG: Realistic Lip-Sync Video Generation with Audio-Modulated Image Inpainting

A Speech-Driven 3-D Lip Synthesis with Realistic Dynamics in Mandarin Chinese

LawDNet: Enhanced Audio-Driven Lip Synthesis via Local Affine Warping Deformation

Deep Restricted Boltzmann Networks

Restricted Boltzmann Machine Based Spectrum Modeling and Unit Selection Speech Synthesis Method

Talking-head Generation with Rhythmic Head Motion

Part-Based Lipreading for Audio-Visual Speech Recognition.

Lip Assistant: Visualize Speech For Hearing Impaired People In Multimedia Services