Abstract:Digital humans find their applications in areas such as virtual companion, virtual reporter, and virtual narrator. As the global trend of digitalization continues, the value of digital humans continues to increase. For example, a virtual teacher may mimic human teachers to deliver personalized education to students spread all over the world at a lower cost. There are many technical difficulties yet to be solved to make digital humans truly valuable. In this talk, I report our recent progresses on addressing two of these difficulties: multi-modal text-to-speech synthesis and multi-modal voice separation and recognition. To address the multi-modal text-to-speech synthesis problem, we developed the duration informed attention network (DurIAN) [1]. DurIAN enhanced the attention-based alignment in the state-of-the-art (SOTA) end-to-end speech synthesis systems such as Tacotron2 [2] with duration information estimated from the rich text input. This technology, while generating high quality natural speech, avoids popular pitfalls such as word repetition and missing in the pure end-to-end systems. More importantly, the system can easily align the facial representation and synthesized speech through the duration model. To more robustly drive the facial expression and mouth movement, we developed a 3D-model guided framework for multi-modal synthesis. To solve the multi-modal voice separation and recognition problem, which is in need in many scenarios such as virtual receptionist, we developed an all deep learning beamformer [3] which integrates the conventional minimum variance distortionless response (MVDR) beamformer, the recurrent neural network-based statistics estimator, and the visual cue guided speaker tracing and diarization system [4]. Our novel approach significantly improved the quality of the separated speech.

DurIAN: Duration Informed Attention Network for Speech Synthesis

DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis

DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis

DIAN: DURATION INFORMED AUTO-REGRESSIVE NETWORK FOR VOICE CLONING

AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis

Synthesising Expressiveness in Peking Opera via Duration Informed Attention Network

Using Speech Enhancement to Realize Speech Synthesis of Low-Resource Dungan Languages

Audio2Face: Generating Speech/Face Animation from Single Audio with Attention-Based Bidirectional LSTM Networks

Building Digital Human

Audio-driven talking face generation with diverse yet realistic facial animations

Neural Speech Synthesis with Transformer Network.

DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer

Focusing on attention: prosody transfer and adaptative optimization strategy for multi-speaker end-to-end speech synthesis

Video-driven speaker-listener generation based on Transformer and neural renderer

NeRF-AD: Neural Radiance Field with Attention-based Disentanglement for Talking Face Synthesis

An End-to-End Multi-Module Audio Deepfake Generation System for ADD Challenge 2023

Speech-driven Facial Animation with Spectral Gathering and Temporal Attention.

Phoneme Dependent Speaker Embedding And Model Factorization For Multi-Speaker Speech Synthesis And Adaptation

Speech Enhancement with Intelligent Neural Homomorphic Synthesis

Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams

Multi-Metric Optimization using Generative Adversarial Networks for Near-End Speech Intelligibility Enhancement