Abstract:Background: Virtual humans have become part of our everyday life (movies, internet, and computer games). Even though they are becoming more and more realistic, their speech capabilities are, most of the time, limited and not coherent and/or not synchronous with the corresponding acoustic signal. Methods: We describe a method to convert a virtual human avatar (animated through key frames and interpolation) into a more naturalistic talking head. In fact, speech articulation cannot be accurately replicated using interpolation between key frames and talking heads with good speech capabilities are derived from real speech production data. Motion capture data are commonly used to provide accurate facial motion for visible speech articulators (jaw and lips) synchronous with acoustics. To access tongue trajectories (partially occluded speech articulator), electromagnetic articulography (EMA) is often used. We recorded a large database of phonetically-balanced English sentences with synchronous EMA, motion capture data, and acoustics. An articulatory model was computed on this database to recover missing data and to provide 'normalized' animation (i.e., articulatory) parameters. In addition, semi-automatic segmentation was performed on the acoustic stream. A dictionary of multimodal Australian English diphones was created. It is composed of the variation of the articulatory parameters between all the successive stable allophones. Results: The avatar's facial key frames were converted into articulatory parameters steering its speech articulators (jaw, lips and tongue). The speech production database was used to drive the Embodied Conversational Agent (ECA) and to enhance its speech capabilities. A Text-To-Auditory Visual Speech synthesizer was created based on the MaryTTS software and on the diphone dictionary derived from the speech production database. Conclusions: We describe a method to transform an ECA with generic tongue model and animation by key frames into a talking head that displays naturalistic tongue, jaw and lip motions. Thanks to a multimodal speech production database, a Text-To-Auditory Visual Speech synthesizer drives the ECA's facial movements enhancing its speech capabilities.

Speech synthesis of VCV sequence using a physiological articulatory model

Significant enhancement of room temperature ferromagnetism in surfactant coated polycrystalline Mn doped ZnO particles

An articulatory model of standard Chinese using MRI and X-ray movie

A Speech-Driven 3-D Tongue Model with Realistic Movement in Mandarin Chinese.

Acoustic VR in the Mouth: A Real-Time Speech-Driven Visual Tongue System.

Deep Speech Synthesis from MRI-Based Articulatory Representations

Vowel Creation by Articulatory Control in HMM-based Parametric Speech Synthesis

Coding Speech through Vocal Tract Kinematics

Laryngeal Muscular Control of Vocal Fold Posturing: Numerical Modeling and Experimental Validation.

A 3D biomechanical vocal tract model to study speech production control: How to take into account the gravity?

Ultraviolet irradiation of murine skin alters cluster formation between lymph node dendritic cells and specific T lymphocytes.

A Speech-Driven 3-D Lip Synthesis with Realistic Dynamics in Mandarin Chinese

RESTRICTION ENZYME ANALYSIS AND HERPES SIMPLEX INFECTIONS

Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders

Radius Vector-Driven 3-D Mandarin Vocal Tract Model

Unsupervised Inference of Physiologically Meaningful Articulatory Trajectories with VocalTractLab

A 3D dynamical biomechanical tongue model to study speech motor control

Progress in animation of an EMA-controlled tongue model for acoustic-visual speech synthesis

A Multilinear Tongue Model Derived from Speech Related MRI Data of the Human Vocal Tract

Feature-Space Transform Tying in Unified Acoustic-Articulatory Modelling for Articulatory Control of HMM-Based Speech Synthesis.

Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis