Abstract:Background: Virtual humans have become part of our everyday life (movies, internet, and computer games). Even though they are becoming more and more realistic, their speech capabilities are, most of the time, limited and not coherent and/or not synchronous with the corresponding acoustic signal. Methods: We describe a method to convert a virtual human avatar (animated through key frames and interpolation) into a more naturalistic talking head. In fact, speech articulation cannot be accurately replicated using interpolation between key frames and talking heads with good speech capabilities are derived from real speech production data. Motion capture data are commonly used to provide accurate facial motion for visible speech articulators (jaw and lips) synchronous with acoustics. To access tongue trajectories (partially occluded speech articulator), electromagnetic articulography (EMA) is often used. We recorded a large database of phonetically-balanced English sentences with synchronous EMA, motion capture data, and acoustics. An articulatory model was computed on this database to recover missing data and to provide 'normalized' animation (i.e., articulatory) parameters. In addition, semi-automatic segmentation was performed on the acoustic stream. A dictionary of multimodal Australian English diphones was created. It is composed of the variation of the articulatory parameters between all the successive stable allophones. Results: The avatar's facial key frames were converted into articulatory parameters steering its speech articulators (jaw, lips and tongue). The speech production database was used to drive the Embodied Conversational Agent (ECA) and to enhance its speech capabilities. A Text-To-Auditory Visual Speech synthesizer was created based on the MaryTTS software and on the diphone dictionary derived from the speech production database. Conclusions: We describe a method to transform an ECA with generic tongue model and animation by key frames into a talking head that displays naturalistic tongue, jaw and lip motions. Thanks to a multimodal speech production database, a Text-To-Auditory Visual Speech synthesizer drives the ECA's facial movements enhancing its speech capabilities.

Phoneme Embedding and its Application to Speech Driven Talking Avatar Synthesis

Enhanced Double-Carrier Word Embedding Via Phonetics and Writing

Expressive Speech Driven Talking Avatar Synthesis with DBLSTM Using Limited Amount of Emotional Bimodal Data

Low Level Descriptors Based DBLSTM Bottleneck Feature for Speech Driven Talking Avatar

Introducing Phonetic Information to Speaker Embedding for Speaker Verification

The Parameterized Phoneme Identity Feature As a Continuous Real-Valued Vector for Neural Network Based Speech Synthesis.

Multilingual and Crosslingual Speech Recognition Using Phonological-Vector Based Phone Embeddings

Phoneme Dependent Speaker Embedding And Model Factorization For Multi-Speaker Speech Synthesis And Adaptation

A real-time speech driven talking avatar based on deep neural network.

Acoustic to Articulatory Mapping with Deep Neural Network

Improve Bilingual TTS Using Dynamic Language and Phonology Embedding

Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis

Phonetic-aware speaker embedding for far-field speaker verification

Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers

Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis

Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams

Phonetic-and-Semantic Embedding of Spoken Words with Applications in Spoken Content Retrieval

Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis

Sem-Avatar: Semantic Controlled Neural Field for High-Fidelity Audio Driven Avatar.

Towards Streaming Speech-to-Avatar Synthesis

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior