Abstract:Recent advancements in neural end-to-end TTS models have shown high-quality, natural synthesized speech in a conventional sentence-based TTS. However, it is still challenging to reproduce similar high quality when a whole paragraph is considered in TTS, where a large amount of contextual information needs to be considered in building a paragraph-based TTS model. To alleviate the difficulty in training, we propose to model linguistic and prosodic information by considering cross-sentence, embedded structure in training. Three sub-modules, including linguistics-aware, prosody-aware and sentence-position networks, are trained together with a modified Tacotron2. Specifically, to learn the information embedded in a paragraph and the relations among the corresponding component sentences, we utilize linguistics-aware and prosody-aware networks. The information in a paragraph is captured by encoders and the inter-sentence information in a paragraph is learned with multi-head attention mechanisms. The relative sentence position in a paragraph is explicitly exploited by a sentence-position network. Trained on a storytelling audio-book corpus (4.08 hours), recorded by a female Mandarin Chinese speaker, the proposed TTS model demonstrates that it can produce rather natural and good-quality speech paragraph-wise. The cross-sentence contextual information, such as break and prosodic variations between consecutive sentences, can be better predicted and rendered than the sentence-based model. Tested on paragraph texts, of which the lengths are similar to, longer than, or much longer than the typical paragraph length of the training data, the TTS speech produced by the new model is consistently preferred over the sentence-based model in subjective tests and confirmed in objective measures.

Word Embedding For Recurrent Neural Network Based Tts Synthesis

Improve Word Embedding Using Both Writing and Pronunciation.

Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis

Phoneme Embedding and Its Application to Speech Driven Talking Avatar Synthesis

Improving Automatic Speech Recognition and Speech Translation Via Word Embedding Prediction

Non-Autoregressive End-to-End TTS with Coarse-to-Fine Decoding

Deep Feed-Forward Sequential Memory Networks for Speech Synthesis

SR-TTS: a rhyme-based end-to-end speech synthesis system

ParaTTS: Learning Linguistic and Prosodic Cross-sentence Information in Paragraph-based TTS

Dependency Parsing based Semantic Representation Learning with Graph Neural Network for Enhancing Expressiveness of Text-to-Speech

DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer

Neural Speech Synthesis with Transformer Network.

Improve Bilingual TTS Using Dynamic Language and Phonology Embedding

Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer

Bi-directional LSTM Recurrent Neural Network for Chinese Word Segmentation

Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech

Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction

Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction

Msam: A Multi-Layer Bi-Lstm Based Speech To Vector Model With Residual Attention Mechanism

Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features

EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech