Abstract:Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective method for improving the diversity and naturalness of synthesized speech. At the high level, previous large-scale TTS models can be categorized into either Auto-regressive (AR) based (\textit{e.g.}, VALL-E) or Non-auto-regressive (NAR) based models (\textit{e.g.}, NaturalSpeech 2/3). Although these works demonstrate good performance, they still have potential weaknesses. For instance, AR-based models are plagued by unstable generation quality and slow generation speed; meanwhile, some NAR-based models need phoneme-level duration alignment information, thereby increasing the complexity of data pre-processing, model design, and loss design. In this work, we build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2. SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods, offering the following key advantages: (1) simplified data preparation; (2) straightforward model and loss design; and (3) stable, high-quality generation performance with fast inference speed. Compared to our previous publication, we present ({\romannumeral1}) a detailed analysis of the influence of speech tokenizer and noisy label for TTS performance; ({\romannumeral2}) four distinct types of sentence duration predictors; ({\romannumeral3}) a novel flow-based scalar latent transformer diffusion model. With these improvement, we show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models. Furthermore, we show that SimpleSpeech 2 can be seamlessly extended to multilingual TTS by training it on multilingual speech datasets. Demos are available on: {https://dongchaoyang.top/SimpleSpeech2\_demo/}.

Lightweight Convolution-Based Chinese Speech Synthesis Method

Mandarin Text-to-Speech Front-End with Lightweight Distilled Convolution Network

Light-tts: lightweight multi-speaker multi-lingual text-to-speech

DOP-Tacotron: a Fast Chinese TTS System with Local-based Attention

CRCTTS: Convolution-Recurrent-Convolution Text-to-Speech System.

Chinese Speech Synthesis System Based on End to End

A Preliminary Study on Deep Learning-based Chinese Text to Taiwanese Speech Synthesis System

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech

Lightspeech: Lightweight Non-Autoregressive Multi-Speaker Text-To-Speech

VF-Taco2: Towards Fast and Lightweight Synthesis for Autoregressive Models with Variation Autoencoder and Feature Distillation.

Improving Model Stability and Training Efficiency in Fast, High Quality Expressive Voice Conversion System

A Novel Method for Mandarin Speech Synthesis by Inserting Prosodic Structure Prediction into Tacotron2.

Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based On FullConv-TTS

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

Neural Speech Synthesis with Transformer Network.

FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

A Transformer-based Chinese Non-autoregressive Speech Synthesis Scheme

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding