Abstract:Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective method for improving the diversity and naturalness of synthesized speech. At the high level, previous large-scale TTS models can be categorized into either Auto-regressive (AR) based (\textit{e.g.}, VALL-E) or Non-auto-regressive (NAR) based models (\textit{e.g.}, NaturalSpeech 2/3). Although these works demonstrate good performance, they still have potential weaknesses. For instance, AR-based models are plagued by unstable generation quality and slow generation speed; meanwhile, some NAR-based models need phoneme-level duration alignment information, thereby increasing the complexity of data pre-processing, model design, and loss design. In this work, we build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2. SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods, offering the following key advantages: (1) simplified data preparation; (2) straightforward model and loss design; and (3) stable, high-quality generation performance with fast inference speed. Compared to our previous publication, we present ({\romannumeral1}) a detailed analysis of the influence of speech tokenizer and noisy label for TTS performance; ({\romannumeral2}) four distinct types of sentence duration predictors; ({\romannumeral3}) a novel flow-based scalar latent transformer diffusion model. With these improvement, we show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models. Furthermore, we show that SimpleSpeech 2 can be seamlessly extended to multilingual TTS by training it on multilingual speech datasets. Demos are available on: {https://dongchaoyang.top/SimpleSpeech2\_demo/}.

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

ViSPer: A Multilingual TTS Approach Based on VITS Using Deep Feature Loss

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech

ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

Speech-T: Transducer for Text to Speech and Beyond

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

Efficient Decoding Self-Attention for End-to-end Speech Synthesis

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models