Abstract:Deep learning-based speech synthesis evolves by employing a sequence-to-sequence (seq2seq) structure with an attention mechanism. The seq2seq speech synthesis model consists of a pair of the encoder for delivering the linguistic features and the decoder for predicting the mel-spectrogram, and learns the alignment between text and speech through the attention mechanism. The decoder predicts the mel-spectrogram by an autoregressive flow that considers the current input and what they have learned from previous inputs. This is beneficial when processing the sequential data, as in speech synthesis. However, the recursive generation of speech typically requires extensive training time, which slows the speed of synthesis. To overcome these obstacles, we propose a non-autoregressive framework for fully parallel deep convolutional neural speech synthesis. Firstly, we design a new synthesis paradigm that integrates a time-varying metatemplate (TVMT), whose length is modeled with a separate conditional distribution, to prepare the decoder input. The decoding step converts the TVMT into spectral features, which eliminates the autoregressive flow. Secondly, we propose a structure that uses multiple decoders interconnected by up-down chains with an iterative attention mechanism. The decoder chains distribute the burden of decoding, progressively infusing the information obtained from the training target example into the chains to refine the predicted spectral features at each decoding step. For each decoder, the attention mechanism is repeatedly applied to produce the elaborated alignment between the linguistic features and the TVMT, which is gradually transformed into the spectral features. The proposed architecture substantially improves the synthesis speed, and the resulting speech quality is superior to that of a conventional autoregressive model.

Bidirectional Decoding Tacotron for Attention Based Neural Speech Synthesis

DOP-Tacotron: a Fast Chinese TTS System with Local-based Attention

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

Forward-Backward Decoding for Regularizing End-to-End TTS

Improving Multi-Speaker Tacotron with Speaker Gating Mechanisms

Efficient Decoding Self-Attention for End-to-end Speech Synthesis

Neural Speech Synthesis with Transformer Network.

DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer

IMPROVING NATURALNESS AND CONTROLLABILITY OF SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS BY LEARNING LOCAL PROSODY REPRESENTATIONS

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Close to Human Quality TTS with Transformer.

Teacher-Student Training For Robust Tacotron-Based TTS

Audiovisual Speech Synthesis using Tacotron2

Speech-T: Transducer for Text to Speech and Beyond

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Multi-speaker Chinese news broadcasting system based on improved Tacotron2

Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features

A Preliminary Study on Deep Learning-based Chinese Text to Taiwanese Speech Synthesis System

FastSpeech: Fast, Robust and Controllable Text to Speech

Non-Autoregressive Fully Parallel Deep Convolutional Neural Speech Synthesis