Bidirectional Decoding Tacotron for Attention Based Neural Speech Synthesis

Wei Zhao,Li Xu
DOI: https://doi.org/10.1145/3561877.3561887
2022-01-01
Abstract:Attention-based neural text-to-speech (TTS) has become increasingly popular because of its end-to-end network architecture and impressive performance comparable to human recordings. However, existing approaches usually adopt a unidirectional decoding framework generating the target spectrum from left to right, which cannot take advantage of reverse target-side contexts from right to left. To mitigate the problem, we present a bidirectional decoding speech synthesis network based on the well-known Tacotron2. In particular, our model first employs a backward decoder to predict the spectrum from right to left, conditioned on the output states of the text encoder. Then, the forward decoder generates the spectrum from left to right, attending to both encoder outputs and the context hidden states from the backward decoder. With this architecture, our bidirectional decoding Tacotron2 can exploit both backward and forward information to promote the performance. Experiments with objective and subjective evaluations on LJSpeech have been conducted to demonstrate the effectiveness of our proposed method.
What problem does this paper attempt to address?