Abstract:Background: Virtual humans have become part of our everyday life (movies, internet, and computer games). Even though they are becoming more and more realistic, their speech capabilities are, most of the time, limited and not coherent and/or not synchronous with the corresponding acoustic signal. Methods: We describe a method to convert a virtual human avatar (animated through key frames and interpolation) into a more naturalistic talking head. In fact, speech articulation cannot be accurately replicated using interpolation between key frames and talking heads with good speech capabilities are derived from real speech production data. Motion capture data are commonly used to provide accurate facial motion for visible speech articulators (jaw and lips) synchronous with acoustics. To access tongue trajectories (partially occluded speech articulator), electromagnetic articulography (EMA) is often used. We recorded a large database of phonetically-balanced English sentences with synchronous EMA, motion capture data, and acoustics. An articulatory model was computed on this database to recover missing data and to provide 'normalized' animation (i.e., articulatory) parameters. In addition, semi-automatic segmentation was performed on the acoustic stream. A dictionary of multimodal Australian English diphones was created. It is composed of the variation of the articulatory parameters between all the successive stable allophones. Results: The avatar's facial key frames were converted into articulatory parameters steering its speech articulators (jaw, lips and tongue). The speech production database was used to drive the Embodied Conversational Agent (ECA) and to enhance its speech capabilities. A Text-To-Auditory Visual Speech synthesizer was created based on the MaryTTS software and on the diphone dictionary derived from the speech production database. Conclusions: We describe a method to transform an ECA with generic tongue model and animation by key frames into a talking head that displays naturalistic tongue, jaw and lip motions. Thanks to a multimodal speech production database, a Text-To-Auditory Visual Speech synthesizer drives the ECA's facial movements enhancing its speech capabilities.

Cospeech body motion generation using a transformer

Text-driven Visual Prosody Generation for Embodied Conversational Agents

BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer

SpeechAct: Towards Generating Whole-body Motion from Speech

Transformer Network for Semantically-Aware and Speech-Driven Upper-Face Generation

Generating Holistic 3D Human Motion from Speech

Freeform Body Motion Generation from Speech

JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Language Model

Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model

Transformer-S2A: Robust and Efficient Speech-to-Animation.

Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents

MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

A Spatio-Temporal Transformer Network for Human Motion Prediction in Human-Robot Collaboration

Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs

Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis

Harmon: Whole-Body Motion Generation of Humanoid Robots from Language Descriptions

Speech-driven Personalized Gesture Synthetics: Harnessing Automatic Fuzzy Feature Inference

Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs

Transformer Based Multi-model Fusion for 3D Facial Animation

Towards Realistic 3D Human Motion Prediction with A Spatio-temporal Cross-transformer Approach

A Two-part Transformer Network for Controllable Motion Synthesis