Abstract:Non-autoregressive text to speech (NAR-TTS) models have attracted much attention from both academia and industry due to their fast generation speed. One limitation of NARTTS models is that they ignore the correlation in time and frequency domains while generating speech mel-spectrograms, and thus cause blurry and over-smoothed results. In this work, we revisit this over-smoothing problem from a novel perspective: the degree of over-smoothness is determined by the gap between the complexity of data distributions and the capability of modeling methods. Both simplifying data distributions and improving modeling methods can alleviate the problem. Accordingly, we first study methods reducing the complexity of data distributions. Then we conduct a comprehensive study on NARTTS models that use some advanced modeling methods. Based on these studies, we find that 1) methods that provide additional condition inputs reduce the complexity of data distributions to model, thus alleviating the oversmoothing problem and achieving better voice quality. 2) Among advanced modeling methods, Laplacian mixture loss performs well at modeling multimodal distributions and enjoys its simplicity, while GAN and Glow achieve the best voice quality while suffering from increased training or model complexity. 3) The two categories of methods can be combined to further alleviate the over-smoothness and improve the voice quality. 4) Our experiments on the multi-speaker dataset lead to similar conclusions as above and providing more variance information can reduce the difficulty of modeling the target data distribution and alleviate the requirements for model capacity.

Comparison of Several Smoothing Methods in Statistical Language Model

Data Noising as Smoothing in Neural Network Language Models

Linear Interpolated Methods in Statistical Natural Language Processing

Improvement Comparison of Different Lattice-based Discriminative Training Methods in Chinese-monolingual and Chinese-English-bilingual Speech Recognition

Improving Language Model Size Reduction Using Better Pruning Criteria

Revisiting Over-Smoothness in Text to Speech

Toward a Unified Approach to Statistical Language Modeling for Chinese

ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding

Comparative Analysis of Language Models for Linguistic Examination of Ancient Chinese Classics: A Case Study of Zuozhuan Corpus.

A Word Language Model Based Contextual Language Processing On Chinese Character Recognition

Smoothing Algorithm of the Task Adaptation Chinese N-gram Model

Comparing Discrete and Continuous Space LLMs for Speech Recognition

An Efficient Approach of Language Model Applying in ASR Systems

Comparison of Modified Kneser-Ney and Witten-Bell Smoothing Techniques in Statistical Language Model of Bahasa Indonesia

A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary Chinese speech recognition

Improved katz smoothing for language modeling in speech recogniton

Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

Weight-importance sparse training in keyword spotting

On smoothing techniques for bigram-based natural language modelling

Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction

A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models