Abstract:Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective method for improving the diversity and naturalness of synthesized speech. At the high level, previous large-scale TTS models can be categorized into either Auto-regressive (AR) based (\textit{e.g.}, VALL-E) or Non-auto-regressive (NAR) based models (\textit{e.g.}, NaturalSpeech 2/3). Although these works demonstrate good performance, they still have potential weaknesses. For instance, AR-based models are plagued by unstable generation quality and slow generation speed; meanwhile, some NAR-based models need phoneme-level duration alignment information, thereby increasing the complexity of data pre-processing, model design, and loss design. In this work, we build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2. SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods, offering the following key advantages: (1) simplified data preparation; (2) straightforward model and loss design; and (3) stable, high-quality generation performance with fast inference speed. Compared to our previous publication, we present ({\romannumeral1}) a detailed analysis of the influence of speech tokenizer and noisy label for TTS performance; ({\romannumeral2}) four distinct types of sentence duration predictors; ({\romannumeral3}) a novel flow-based scalar latent transformer diffusion model. With these improvement, we show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models. Furthermore, we show that SimpleSpeech 2 can be seamlessly extended to multilingual TTS by training it on multilingual speech datasets. Demos are available on: {https://dongchaoyang.top/SimpleSpeech2\_demo/}.

On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

A Study of Discriminatory Speech Classification Based on Improved Smote and SVM-RF

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR

TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch

A Large-Scale Evaluation of Speech Foundation Models

Less is More: Accurate Speech Recognition & Translation without Web-Scale Data

Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

On the Impact of Noises in Crowd-Sourced Data for Speech Translation

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

A Study on Incorporating Whisper for Robust Speech Assessment

Bridging the Gap Between Clean Data Training and Real-World Inference for Spoken Language Understanding

Investigating the Effects of Large-Scale Pseudo-Stereo Data and Different Speech Foundation Model on Dialogue Generative Spoken Language Model

Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data

Don't Speak Too Fast: the Impact of Data Bias on Self-Supervised Speech Models