Building Digital Human

Dong Yu
DOI: https://doi.org/10.1145/3394171.3425172
2020-10-12
Abstract:Digital humans find their applications in areas such as virtual companion, virtual reporter, and virtual narrator. As the global trend of digitalization continues, the value of digital humans continues to increase. For example, a virtual teacher may mimic human teachers to deliver personalized education to students spread all over the world at a lower cost. There are many technical difficulties yet to be solved to make digital humans truly valuable. In this talk, I report our recent progresses on addressing two of these difficulties: multi-modal text-to-speech synthesis and multi-modal voice separation and recognition. To address the multi-modal text-to-speech synthesis problem, we developed the duration informed attention network (DurIAN) [1]. DurIAN enhanced the attention-based alignment in the state-of-the-art (SOTA) end-to-end speech synthesis systems such as Tacotron2 [2] with duration information estimated from the rich text input. This technology, while generating high quality natural speech, avoids popular pitfalls such as word repetition and missing in the pure end-to-end systems. More importantly, the system can easily align the facial representation and synthesized speech through the duration model. To more robustly drive the facial expression and mouth movement, we developed a 3D-model guided framework for multi-modal synthesis. To solve the multi-modal voice separation and recognition problem, which is in need in many scenarios such as virtual receptionist, we developed an all deep learning beamformer [3] which integrates the conventional minimum variance distortionless response (MVDR) beamformer, the recurrent neural network-based statistics estimator, and the visual cue guided speaker tracing and diarization system [4]. Our novel approach significantly improved the quality of the separated speech.
What problem does this paper attempt to address?