Abstract:Recent advancements in neural end-to-end TTS models have shown high-quality, natural synthesized speech in a conventional sentence-based TTS. However, it is still challenging to reproduce similar high quality when a whole paragraph is considered in TTS, where a large amount of contextual information needs to be considered in building a paragraph-based TTS model. To alleviate the difficulty in training, we propose to model linguistic and prosodic information by considering cross-sentence, embedded structure in training. Three sub-modules, including linguistics-aware, prosody-aware and sentence-position networks, are trained together with a modified Tacotron2. Specifically, to learn the information embedded in a paragraph and the relations among the corresponding component sentences, we utilize linguistics-aware and prosody-aware networks. The information in a paragraph is captured by encoders and the inter-sentence information in a paragraph is learned with multi-head attention mechanisms. The relative sentence position in a paragraph is explicitly exploited by a sentence-position network. Trained on a storytelling audio-book corpus (4.08 hours), recorded by a female Mandarin Chinese speaker, the proposed TTS model demonstrates that it can produce rather natural and good-quality speech paragraph-wise. The cross-sentence contextual information, such as break and prosodic variations between consecutive sentences, can be better predicted and rendered than the sentence-based model. Tested on paragraph texts, of which the lengths are similar to, longer than, or much longer than the typical paragraph length of the training data, the TTS speech produced by the new model is consistently preferred over the sentence-based model in subjective tests and confirmed in objective measures.

Building Mongolian Tts Front-End With Encoder-Decoder Model By Using Bridge Method And Multi-View Features

ViSPer: A Multilingual TTS Approach Based on VITS Using Deep Feature Loss

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Unified Mandarin TTS Front-end Based on Distilled BERT Model

Building Multi lingual TTS using Cross Lingual Voice Conversion

A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis

Bidirectional Decoding Tacotron for Attention Based Neural Speech Synthesis

Knowledge-based Linguistic Encoding for End-to-End Mandarin Text-to-Speech Synthesis

Mandarin Text-to-Speech Front-End with Lightweight Distilled Convolution Network

A unified front-end framework for English text-to-speech synthesis

Scalable Multilingual Frontend for TTS

A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation

Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling

Forward-Backward Decoding for Regularizing End-to-End TTS

EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech

Improving Uyghur ASR systems with decoders using morpheme-based language models

Improve Bilingual TTS Using Dynamic Language and Phonology Embedding

ParaTTS: Learning Linguistic and Prosodic Cross-sentence Information in Paragraph-based TTS

Multi-channel Encoder for Neural Machine Translation

EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

A Deep Investigation of RNN and Self-attention for the Cyrillic-Traditional Mongolian Bidirectional Conversion