Abstract:Non-autoregressive text to speech (NAR-TTS) models have attracted much attention from both academia and industry due to their fast generation speed. One limitation of NARTTS models is that they ignore the correlation in time and frequency domains while generating speech mel-spectrograms, and thus cause blurry and over-smoothed results. In this work, we revisit this over-smoothing problem from a novel perspective: the degree of over-smoothness is determined by the gap between the complexity of data distributions and the capability of modeling methods. Both simplifying data distributions and improving modeling methods can alleviate the problem. Accordingly, we first study methods reducing the complexity of data distributions. Then we conduct a comprehensive study on NARTTS models that use some advanced modeling methods. Based on these studies, we find that 1) methods that provide additional condition inputs reduce the complexity of data distributions to model, thus alleviating the oversmoothing problem and achieving better voice quality. 2) Among advanced modeling methods, Laplacian mixture loss performs well at modeling multimodal distributions and enjoys its simplicity, while GAN and Glow achieve the best voice quality while suffering from increased training or model complexity. 3) The two categories of methods can be combined to further alleviate the over-smoothness and improve the voice quality. 4) Our experiments on the multi-speaker dataset lead to similar conclusions as above and providing more variance information can reduce the difficulty of modeling the target data distribution and alleviate the requirements for model capacity.

Improved katz smoothing for language modeling in speech recogniton

Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech

Modeling and Simulation of English Speech Rationality Optimization Recognition Based on Improved Particle Filter Algorithm

Improving HMM Based Speech Synthesis by Reducing Over-Smoothing Problems

Comparison of Several Smoothing Methods in Statistical Language Model

Revisiting Over-Smoothness in Text to Speech

A Word Language Model Based Contextual Language Processing On Chinese Character Recognition

An Approach of Fundamental Frequencies Smoothing for Chinese Tone Recognition

Smoothing Algorithm of the Task Adaptation Chinese N-gram Model

An Innovative Prosody Modeling Method for Chinese Speech Recognition

Statistical Modification Based Post-Filtering Technique for HMM-based Speech Synthesis

A Smoothing Algorithm For The Task Adaptation Chinese Trigram Model

Modeling Pronunciation Variation Using Context-Dependent Weighting and B/s Refined Acoustic Modeling.

Language Model Adaptation Based on the Classification of a Trigram's Language Style Feature

Improvement of hidden Markov model (HMM) for speech recognition

Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text

Towards Robustness to Speech Rate in Mandarin All-Syllable Recognition

Improved speech recognition algorithm based on MFCC feature

Improved Posterior Probability Estimation Methods for the Freely-Spoken Speech Evaluation

Reliable Accent-Specific Unit Generation With Discriminative Dynamic Gaussian Mixture Selection for Multi-Accent Chinese Speech Recognition

Stochastic Language Models for Chinese Speech Recognition Based on Chinese Spelling