Transformer Speech Synthesis for Tibetan Lhasa Dialect Based on Multilevel Granular Unit

Xiaona Xu,Ning Li,Yue Zhao
DOI: https://doi.org/10.1109/PRAI59366.2023.10331979
2023-01-01
Abstract:The application of Tacotron model in Tibetan end-to-end speech synthesis has achieved good results, however, this model which is based Recurrent Neural Network (RNN) suffers from low training and prediction efficiency and long-range information loss. To further improve the effect of Tibetan speech synthesis, an end-to-end speech synthesis model based on Transformer is proposed to realize the speech synthesis of multiple dialects of Tibetan. In the model, the hidden state in the Encoder and Decoder is constructed in parallel by using the multi-head attention mechanism, which effectively solves the problem of modeling long-distance information correlation and can take the advantage of multi-GPU parallel training. In this work, three different granular units (Tibetan characters, Latin letters, and Tibetan components) are selected as the input to the acoustic model for the purpose of selecting the best granular unit to improve synthesis result. The Transformer TTS network transforms the text sequences into Mel spectrograms, and the WaveNet vocoder converts the Mel spectrograms into final speech waveforms. This study conducts a series of comparative experiments. Firstly, the performances of three synthesis granular units based on our proposed method are compared, and then the effects of single-GPU training and multi-GPU parallel training are compared. In addition, a comparative experiment of Tacotron and Transformer applied to Tibetan Lhasa dialect speech synthesis is conducted. The experimental results show that the end-to-end speech synthesis model based on Transformer has better performance than Tacotron in Tibetan Lhasa dialect speech synthesis. The speech obtained by using Latin letters as the synthesis unit and parallel training with multi-GPU has better clarity and naturalness.
What problem does this paper attempt to address?