Abstract:This paper presents a speech BERT model to extract embedded prosody information in speech segments for improving the prosody of synthesized speech in neural text-to-speech (TTS). As a pre-trained model, it can learn prosody attributes from a large amount of speech data, which can utilize more data than the original training data used by the target TTS. The embedding is extracted from the previous segment of a fixed length in the proposed BERT. The extracted embedding is then used together with the mel-spectrogram to predict the following segment in the TTS decoder. Experimental results obtained by the Transformer TTS show that the proposed BERT can extract fine-grained, segment-level prosody, which is complementary to utterance-level prosody to improve the final prosody of the TTS speech. The objective distortions measured on a single speaker TTS are reduced between the generated speech and original recordings. Subjective listening tests also show that the proposed approach is favorably preferred over the TTS without the BERT prosody embedding module, for both in-domain and out-of-domain applications. For Microsoft professional, single/multiple speakers and the LJ Speaker in the public database, subjective preference is similarly confirmed with the new BERT prosody embedding. TTS demo audio samples are in <a class="link-external link-https" href="https://judy44chen.github.io/TTSSpeechBERT/" rel="external noopener nofollow">this https URL</a>.

Prosodic Structure Prediction Using Deep Self-attention Neural Network

Time Domain Speech Enhancement Using Self-Attention-Based Subspace Projection

Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features

Automatic Prosodic Structure Labeling using DNN-BGRU-CRF Hybrid Neural Network.

Blstm-Crf Based End-To-End Prosodic Boundary Prediction With Context Sensitive Embeddings In A Text-To-Speech Front-End

Multi-Task Learning for Prosodic Structure Generation Using BLSTM RNN with Structured Output Layer.

Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit

A Character-level Span-based Model for Mandarin Prosodic Structure Prediction

Efficient Decoding Self-Attention for End-to-end Speech Synthesis

Automatic Prosody Annotation with Pre-Trained Text-Speech Model

Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information

Speech BERT Embedding For Improving Prosody in Neural TTS

Mandarin Text-to-Speech Front-End with Lightweight Distilled Convolution Network

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

Statistical Model Based on Probability Frequency for Mandarin Prosodic Structure Prediction

Ensemble prosody prediction for expressive speech synthesis

Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction with Multi-Modal Embeddings

Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech

Prosody Modelling with Pre-trained Cross-utterance Representations for Improved Speech Synthesis

Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data

Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model