DurIAN: Duration Informed Attention Network for Speech Synthesis

Chengzhu Yu,Heng Lu,Na Hu,Meng Yu,Chao Weng,Kun Xu,Peng Liu,Deyi Tuo,Shiyin Kang,Guangzhi Lei,Dan Su,Dong Yu
DOI: https://doi.org/10.21437/interspeech.2020-2968
2020-01-01
Abstract:In this paper, we present a generic and robust multimodal synthesis systemthat produces highly natural speech and facial expression simultaneously. Thekey component of this system is the Duration Informed Attention Network(DurIAN), an autoregressive model in which the alignments between the inputtext and the output acoustic features are inferred from a duration model. Thisis different from the end-to-end attention mechanism used, and accounts forvarious unavoidable artifacts, in existing end-to-end speech synthesis systemssuch as Tacotron. Furthermore, DurIAN can be used to generate high qualityfacial expression which can be synchronized with generated speech with/withoutparallel speech and face data. To improve the efficiency of speech generation,we also propose a multi-band parallel generation strategy on top of the WaveRNNmodel. The proposed Multi-band WaveRNN effectively reduces the totalcomputational complexity from 9.8 to 5.5 GFLOPS, and is able to generate audiothat is 6 times faster than real time on a single CPU core. We show that DurIANcould generate highly natural speech that is on par with current state of theart end-to-end systems, while at the same time avoid word skipping/repeatingerrors in those systems. Finally, a simple yet effective approach forfine-grained control of expressiveness of speech and facial expression isintroduced.
What problem does this paper attempt to address?