Enhancing Monotonicity for Robust Autoregressive Transformer TTS

Xiangyu Liang,Zhiyong Wu,Runnan Li,Yanqing Liu,Sheng Zhao,Helen Meng
DOI: https://doi.org/10.21437/interspeech.2020-1751
2020-01-01
Abstract:With the development of sequence-to-sequence modeling algorithms, Text-to-Speech (TTS) techniques have achieved significant improvement in speech quality and naturalness. These deep learning algorithms, such as recurrent neural networks (RNNs) and its memory enhanced variations, have shown strong reconstruction ability from input linguistic features to acoustic features. However, the efficiency of these algorithms is limited for its sequential process in both training and inference. Recently, Transformer with superiority in parallelism is proposed to TTS. It employs the positional embedding instead of recurrent mechanism for position modeling and significantly boosts training speed. However, this approach lacks monotonic constraint and is deficient with issues like pronunciation skipping. Therefore, in this paper, we propose a monotonicity enhancing approach with the combining use of Stepwise Monotonic Attention (SMA) and multi-head attention for Transformer based TTS system. Experiments show the proposed approach can reduce bad cases from 53 of 500 sentences to 1, together with an improvement on MOS from 4.09 to 4.17 in the naturalness test.
What problem does this paper attempt to address?