Abstract:This paper presents an investigation into predicting the movement of a speaker's mouth from text input using hidden Markov models (HMM). A corpus of human articulatory movements, recorded by electromagnetic articulography (EMA), is used to train HMMs. To predict articulatory movements for input text, a suitable model sequence is selected and a maximum-likelihood parameter generation (MLPG) algorithm is used to generate output articulatory trajectories. Unified acoustic-articulatory HMMs are introduced to integrate acoustic features when an acoustic signal is also provided with the input text. Several aspects of this method are analyzed in this paper, including the effectiveness of context-dependent modeling, the role of supplementary acoustic input, and the appropriateness of certain model structures for the unified acoustic-articulatory models. When text is the sole input, we find that fully context-dependent models significantly outperform monophone and quinphone models, achieving an average root mean square (RMS) error of 1.945mm and an average correlation coefficient of 0.600. When both text and acoustic features are given as input to the system, the difference between the performance of quinphone models and fully context-dependent models is no longer significant. The best performance overall is achieved using unified acoustic-articulatory quinphone HMMs with separate clustering of acoustic and articulatory model parameters, a synchronous-state sequence, and a dependent-feature model structure, with an RMS error of 0.900mm and a correlation coefficient of 0.855 on average. Finally, we also apply the same quinphone HMMs to the acoustic-articulatory, or inversion, mapping problem, where only acoustic input is available. An average root mean square (RMS) error of 1.076mm and an average correlation coefficient of 0.812 are achieved. Taken together, our results demonstrate how text and acoustic inputs both contribute to the prediction of articulatory movements in the method used.

Evaluation Of Linear Regression For Speaker Adaptation In Hmm-Based Articulatory Movements Estimation

Agmma: A Novel Incremental Adaptation Method And Its Application To Speaker Recognition

MAP-based Speaker Adaptation in Speech Synthesis

Estimate Articulatory Mri Series From Acoustic Signal Using Deep Architecture

On the Evaluation of Inversion Mapping Performance in the Acoustic Domain

Unsupervised Acoustic-to-Articulatory Inversion with Variable Vocal Tract Anatomy

Feature-Space Transform Tying in Unified Acoustic-Articulatory Modelling for Articulatory Control of HMM-Based Speech Synthesis.

Latent Correlation Analysis of HMM Parameters for Speech Recognition

Speaker-Independent Acoustic-to-Articulatory Speech Inversion

A New Subspace Based Speaker Adaptation Method

An Analysis of HMM-based Prediction of Articulatory Movements

Speaker-Independent Acoustic-to-Articulatory Inversion through Multi-Channel Attention Discriminator

A Speaker Adaptation Algorithm Based on Matrix Linear Interpolation

Integrating Articulatory Features into HMM-Based Parametric Speech Synthesis

Label Transform Based Cross-Language Speaker Adaptation in Bilingual (Mandarin-English) TTS

Speech Recognition Using Speaker Adaptation by System Parameter Transformation.

Articulatory-to-acoustic Conversion Using BLSTM-RNNs with Augmented Input Representation.

Rapid Speaker Adaptation Using Multi-Stream Structural Maximum Likelihood Eigenspace Mapping

Speaker adaptation using maximum likelihood model interpolation

Improving the Performance of HMM-based Voice Conversion Using Context Clustering Decision Tree and Appropriate Regression Matrix Format.

Rapid discriminative acoustic model based on eigenspace mapping for fast speaker adaptation