Abstract:Audiovisual speech synthesis is the problem of synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. In this paper, we propose and compare two audiovisual speech synthesis systems for 3D face models. The first system is the AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based on the Tacotron2 architecture. AVTacotron2 converts a sequence of phonemes representing the sentence to synthesize into a sequence of acoustic features and the corresponding controllers of a face model. The output acoustic features are used to condition a WaveRNN to reconstruct the speech waveform, and the output facial controllers are used to generate the corresponding video of the talking face. The second audiovisual speech synthesis system is modular, where acoustic speech is synthesized from text using the traditional Tacotron2. The reconstructed acoustic speech signal is then used to drive the facial controls of the face model using an independently trained audio-to-facial-animation neural network. We further condition both the end-to-end and modular approaches on emotion embeddings that encode the required prosody to generate emotional audiovisual speech. We analyze the performance of the two systems and compare them to the ground truth videos using subjective evaluation tests. The end-to-end and modular systems are able to synthesize close to human-like audiovisual speech with mean opinion scores (MOS) of 4.1 and 3.9, respectively, compared to a MOS of 4.1 for the ground truth generated from professionally recorded videos. While the end-to-end system gives a better overall quality, the modular approach is more flexible and the quality of acoustic speech and visual speech synthesis is almost independent of each other.

Visual Speech Synthesis Algorithm Based on Chinese Visual Triphone

Text-To-Visual Speech in Chinese Based on Data-Driven Approach

Text-driven Visual Prosody Generation for Embodied Conversational Agents

Automatic Selecting Algorithm of Bimodal Corpus Based on Visual Triphone

VISUAL SPEECH SYNTHESIS BASED ON ARTICULATORY TRAJECTORY

Construction of Chinese Bimodal Corpus Based on Visual Triphone

Lip Movement Synthesis in Audio-Visual Speech Recognition System

Realistic Visual Speech Synthesis Based on Hybrid Concatenation Method

Real-time Synthesis of Chinese Visua using MPEG-4 FAP Features in a

Accurate Visual Speech Synthesis Based on Diviseme Unit Selection and Concatenation

Real-time Synthesis of Chinese Visual Speech and Facial Expressions Using MPEG-4 FAP Features in a Three-Dimensional Avatar

Learning based visual speech synthesis system

Audio-driven Talking Face Video Generation with Natural Head Pose

Speech Driven Facial Animation Using Chinese Mandarin Pronunciation Rules

Bi-level Codebook Based Speech-driven Visual-speech Synthesis System

An Emotional Text-Driven 3D Visual Pronunciation System for Mandarin Chinese

A Speech-Driven 3-D Lip Synthesis with Realistic Dynamics in Mandarin Chinese

Improving Cross-Lingual Speech Synthesis with Triplet Training Scheme

Visualize Speech: a Continuous Speech Recognition System for Facial Animation Using Acoustic Visemes

Audiovisual Speech Synthesis using Tacotron2

Visual Speech Synthesis Based on Learning Model