Abstract:Recent advancements in neural end-to-end TTS models have shown high-quality, natural synthesized speech in a conventional sentence-based TTS. However, it is still challenging to reproduce similar high quality when a whole paragraph is considered in TTS, where a large amount of contextual information needs to be considered in building a paragraph-based TTS model. To alleviate the difficulty in training, we propose to model linguistic and prosodic information by considering cross-sentence, embedded structure in training. Three sub-modules, including linguistics-aware, prosody-aware and sentence-position networks, are trained together with a modified Tacotron2. Specifically, to learn the information embedded in a paragraph and the relations among the corresponding component sentences, we utilize linguistics-aware and prosody-aware networks. The information in a paragraph is captured by encoders and the inter-sentence information in a paragraph is learned with multi-head attention mechanisms. The relative sentence position in a paragraph is explicitly exploited by a sentence-position network. Trained on a storytelling audio-book corpus (4.08 hours), recorded by a female Mandarin Chinese speaker, the proposed TTS model demonstrates that it can produce rather natural and good-quality speech paragraph-wise. The cross-sentence contextual information, such as break and prosodic variations between consecutive sentences, can be better predicted and rendered than the sentence-based model. Tested on paragraph texts, of which the lengths are similar to, longer than, or much longer than the typical paragraph length of the training data, the TTS speech produced by the new model is consistently preferred over the sentence-based model in subjective tests and confirmed in objective measures.

Improving Cross-Lingual Speech Synthesis with Triplet Training Scheme

ViSPer: A Multilingual TTS Approach Based on VITS Using Deep Feature Loss

An Improved Cross-Language Model Adaptation Method for Speech Synthesis

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

Learning Cross-Lingual Information with Multilingual BLSTM for Speech Synthesis of Low-Resource Languages

Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS

Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

Building Multi lingual TTS using Cross Lingual Voice Conversion

HMM Based TTS for Mixed Language Text.

Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT

Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation

ParaTTS: Learning Linguistic and Prosodic Cross-sentence Information in Paragraph-based TTS

Tri-stage training with language-specific encoder and bilingual acoustic learner for code-switching speech recognition

TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation

Improving Deep Neural Network Based Speech Synthesis Through Contextual Feature Parametrization and Multi-Task Learning

Using joint training speaker encoder with consistency loss to achieve cross-lingual voice conversion and expressive voice conversion

Improve Bilingual TTS Using Dynamic Language and Phonology Embedding

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations