Abstract:This paper presents an investigation into predicting the movement of a speaker's mouth from text input using hidden Markov models (HMM). A corpus of human articulatory movements, recorded by electromagnetic articulography (EMA), is used to train HMMs. To predict articulatory movements for input text, a suitable model sequence is selected and a maximum-likelihood parameter generation (MLPG) algorithm is used to generate output articulatory trajectories. Unified acoustic-articulatory HMMs are introduced to integrate acoustic features when an acoustic signal is also provided with the input text. Several aspects of this method are analyzed in this paper, including the effectiveness of context-dependent modeling, the role of supplementary acoustic input, and the appropriateness of certain model structures for the unified acoustic-articulatory models. When text is the sole input, we find that fully context-dependent models significantly outperform monophone and quinphone models, achieving an average root mean square (RMS) error of 1.945mm and an average correlation coefficient of 0.600. When both text and acoustic features are given as input to the system, the difference between the performance of quinphone models and fully context-dependent models is no longer significant. The best performance overall is achieved using unified acoustic-articulatory quinphone HMMs with separate clustering of acoustic and articulatory model parameters, a synchronous-state sequence, and a dependent-feature model structure, with an RMS error of 0.900mm and a correlation coefficient of 0.855 on average. Finally, we also apply the same quinphone HMMs to the acoustic-articulatory, or inversion, mapping problem, where only acoustic input is available. An average root mean square (RMS) error of 1.076mm and an average correlation coefficient of 0.812 are achieved. Taken together, our results demonstrate how text and acoustic inputs both contribute to the prediction of articulatory movements in the method used.

On the Evaluation of Inversion Mapping Performance in the Acoustic Domain

Speaker-Independent Acoustic-to-Articulatory Speech Inversion

Evaluation Of Linear Regression For Speaker Adaptation In Hmm-Based Articulatory Movements Estimation

Unsupervised Acoustic-to-Articulatory Inversion with Variable Vocal Tract Anatomy

Estimate Articulatory Mri Series From Acoustic Signal Using Deep Architecture

The Use of Articulatory Movement Data in Speech Synthesis Applications: an Overview — Application of Articulatory Movements Using Machine Learning Algorithms —

Kullback-Leibler Divergence Based Performace Evaluation of Geoacoustic Inverison Using Sources of Opportunity

Articulatory-WaveNet: Autoregressive Model For Acoustic-to-Articulatory Inversion

An Analysis of HMM-based Prediction of Articulatory Movements

DNN-based Acoustic-to-Articulatory Inversion using Ultrasound Tongue Imaging

A generalized smoothness criterion for acoustic-to-articulatory inversion

Speaker-Independent Acoustic-to-Articulatory Inversion through Multi-Channel Attention Discriminator

Study on a Joint Inversion Algorithm for Acoustic and Electromagnetic Data Based on Contrast Source Inversion Method and Cross-gradient Constraint

Unsupervised Inference of Physiologically Meaningful Articulatory Trajectories with VocalTractLab

Complete reconstruction of the tongue contour through acoustic to articulatory inversion using real-time MRI data

A deep recurrent approach for acoustic-to-articulatory inversion

Feature-Space Transform Tying in Unified Acoustic-Articulatory Modelling for Articulatory Control of HMM-Based Speech Synthesis.

Speaker-independent speech inversion for recovery of velopharyngeal port constriction degreea)

Decoding Vocal Articulations from Acoustic Latent Representations

Integrating Articulatory Features into HMM-Based Parametric Speech Synthesis

Acoustic-to-articulatory inversion for dysarthric speech: Are pre-trained self-supervised representations favorable?