Abstract:This paper describes the Microsoft end-to-end neural text to speech (TTS) system: DelightfulTTS for Blizzard Challenge 2021. The goal of this challenge is to synthesize natural and high-quality speech from text, and we approach this goal in two perspectives: The first is to directly model and generate waveform in 48 kHz sampling rate, which brings higher perception quality than previous systems with 16 kHz or 24 kHz sampling rate; The second is to model the variation information in speech through a systematic design, which improves the prosody and naturalness. Specifically, for 48 kHz modeling, we predict 16 kHz mel-spectrogram in acoustic model, and propose a vocoder called HiFiNet to directly generate 48 kHz waveform from predicted 16 kHz mel-spectrogram, which can better trade off training efficiency, modelling stability and voice quality. We model variation information systematically from both explicit (speaker ID, language ID, pitch and duration) and implicit (utterance-level and phoneme-level prosody) perspectives: 1) For speaker and language ID, we use lookup embedding in training and inference; 2) For pitch and duration, we extract the values from paired text-speech data in training and use two predictors to predict the values in inference; 3) For utterance-level and phoneme-level prosody, we use two reference encoders to extract the values in training, and use two separate predictors to predict the values in inference. Additionally, we introduce an improved Conformer block to better model the local and global dependency in acoustic model. For task SH1, DelightfulTTS achieves 4.17 mean score in MOS test and 4.35 in SMOS test, which indicates the effectiveness of our proposed system

Analysis Syntactic Parsing Speech Synthesis Text Text Analysis Syntactic Parsing Speech Database Training Synthesis Training of HMMs Speech Database Data Selection Feature Extraction HMMs

The USTC System for Blizzard Challenge 2009

BLSTM Guided Unit Selection Synthesis System for Blizzard Challenge 2016

USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method

The USTC System for Blizzard Challenge 2008

The USTC and iFlytek Speech Synthesis Systems for Blizzard Challenge 2007

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2009

The FruitShell French synthesis system at the Blizzard 2023 Challenge

The Sogou Speech Synthesis System for Blizzard Challenge 2018

The USTC System for Blizzard Challenge 2010

Robustness of HMM-based Speech Synthesis

The Iflytek System for Blizzard Machine Learning Challenge 2017-ES1

The DeepZen Speech Synthesis System for Blizzard Challenge 2023

Text Split Upon Space Silence Tag Insertion Letter To Unicode Transformation AssameseTamil Gujarati Pause after SWord Pause at the End Pause in punctuation Label Generation Context information For Tree-Based Clustering Letter Sets Text Tegulu Rajasthan

The NLPR Speech Synthesis Entry for Blizzard Challenge 2020

MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023

DelightfulTTS: the Microsoft Speech Synthesis System for Blizzard Challenge 2021

The NTU-AISG Text-to-speech System for Blizzard Challenge 2020

The WISTON Text to Speech System for Blizzard 2008

Design of Speech Corpus for Mandarin Text to Speech