Attention-based Recurrent Generator with Gaussian Tolerance for Statistical Parametric Speech Synthesis

Xixin Wu,Shiyin Kang,Lifa Sun,Yishuang Ning,Zhiyong Wu,Helen Meng
2017-01-01
Abstract:Conventional statistical parametric speech synthesis (SPSS) generates frame-level acoustic features in two separately optimized steps—namely, duration prediction and acoustic feature generation. It also incorporates a conditional independence assumption to generate independent output frames given textual inputs. Both factors constrain the quality of the generated speech output. This work proposes to apply the attention-based recurrent generator (ARG) with Gaussian Tolerance (GT) for SPSS, where duration prediction and acoustic feature generation are jointly optimized with attention mechanism, and the dependency across output frames is modeled by acoustic feature generation conditioned on preceding frames. GT is introduced to train ARG to acquire robustness based on previous output frames with errors. Perceptual experiments comparing the naturalness between ARG and the conventional hidden Markov model show a gain in MOS score and the effectiveness of GT.
What problem does this paper attempt to address?