Abstract:This paper presents an investigation into predicting the movement of a speaker's mouth from text input using hidden Markov models (HMM). A corpus of human articulatory movements, recorded by electromagnetic articulography (EMA), is used to train HMMs. To predict articulatory movements for input text, a suitable model sequence is selected and a maximum-likelihood parameter generation (MLPG) algorithm is used to generate output articulatory trajectories. Unified acoustic-articulatory HMMs are introduced to integrate acoustic features when an acoustic signal is also provided with the input text. Several aspects of this method are analyzed in this paper, including the effectiveness of context-dependent modeling, the role of supplementary acoustic input, and the appropriateness of certain model structures for the unified acoustic-articulatory models. When text is the sole input, we find that fully context-dependent models significantly outperform monophone and quinphone models, achieving an average root mean square (RMS) error of 1.945mm and an average correlation coefficient of 0.600. When both text and acoustic features are given as input to the system, the difference between the performance of quinphone models and fully context-dependent models is no longer significant. The best performance overall is achieved using unified acoustic-articulatory quinphone HMMs with separate clustering of acoustic and articulatory model parameters, a synchronous-state sequence, and a dependent-feature model structure, with an RMS error of 0.900mm and a correlation coefficient of 0.855 on average. Finally, we also apply the same quinphone HMMs to the acoustic-articulatory, or inversion, mapping problem, where only acoustic input is available. An average root mean square (RMS) error of 1.076mm and an average correlation coefficient of 0.812 are achieved. Taken together, our results demonstrate how text and acoustic inputs both contribute to the prediction of articulatory movements in the method used.

Minimum Generation Error Training for HMM-based Prediction of Articulatory Movements

Minimum Generation Error Training for HMM-Based Speech Synthesis

Minimum generation error training with weighted Euclidean distance on LSP for HMM-based speech synthesis

Model Adaptation for HMM-Based Speech Synthesis under Minimum Generation Error Criterion

Minimum Generation Error Training With Direct Log Spectral Distortion On Lsps For Hmm-Based Speech Synthesis

An Analysis of HMM-based Prediction of Articulatory Movements

Full HMM Training for Minimizing Generation Error in Synthesis

Minimum Generation Error Training by Using Original Spectrum As Reference for Log Spectral Distortion Measure

Minimum Unit Selection Error Training for HMM-based Unit Selection Speech Synthesis System

HMM training method based on evolutionary computation and MDI in speech recognition

Training HMMs using a minimum error criterion with different loss measures

Cross-Validation and Minimum Generation Error Based Decision Tree Pruning for HMM-based Speech Synthesis

Modeling Pitch Trajectory by Hierarchical HMM with Minimum Generation Error Training.

Learning Virtual HD Model for Bi-model Emotional Speaker Recognition

Cross Validation and Minimum Generation Error for Improved Model Clustering in HMM-based TTS

Preserve ordering property of generated LSPS for minimum generation error training in HMM-based speech synthesis

Target-filtering model based articulatory movement prediction for articulatory control of HMM-based speech synthesis

Discriminative Training and Explicit Duration Modeling for HMM-based Automatic Segmentation

Using Generalized Gaussian Distributions to Improve Regression Error Modeling for Deep Learning-Based Speech Enhancement.

Automatic Error Detection For Unit Selection Speech Synthesis Using Log Likelihood Ratio Based Svm Classifier

Soft GPD for Minimum Classification Error Rate Training.