Abstract:Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective method for improving the diversity and naturalness of synthesized speech. At the high level, previous large-scale TTS models can be categorized into either Auto-regressive (AR) based (\textit{e.g.}, VALL-E) or Non-auto-regressive (NAR) based models (\textit{e.g.}, NaturalSpeech 2/3). Although these works demonstrate good performance, they still have potential weaknesses. For instance, AR-based models are plagued by unstable generation quality and slow generation speed; meanwhile, some NAR-based models need phoneme-level duration alignment information, thereby increasing the complexity of data pre-processing, model design, and loss design. In this work, we build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2. SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods, offering the following key advantages: (1) simplified data preparation; (2) straightforward model and loss design; and (3) stable, high-quality generation performance with fast inference speed. Compared to our previous publication, we present ({\romannumeral1}) a detailed analysis of the influence of speech tokenizer and noisy label for TTS performance; ({\romannumeral2}) four distinct types of sentence duration predictors; ({\romannumeral3}) a novel flow-based scalar latent transformer diffusion model. With these improvement, we show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models. Furthermore, we show that SimpleSpeech 2 can be seamlessly extended to multilingual TTS by training it on multilingual speech datasets. Demos are available on: {https://dongchaoyang.top/SimpleSpeech2\_demo/}.

CTC-based Non-autoregressive Textless Speech-to-Speech Translation

Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input

A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding.

DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation

LV-CTC: Non-autoregressive ASR with CTC and latent variable models

Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models

Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?

Consecutive Decoding for Speech-to-text Translation

CTC-GMM: CTC guided modality matching for fast and accurate streaming speech translation

Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation.

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

Textless Speech-to-Speech Translation With Limited Parallel Data

Back Translation for Speech-to-text Translation Without Transcripts

Hybrid Autoregressive and Non-Autoregressive Transformer Models for Speech Recognition

Improving Non-autoregressive Translation Quality with Pretrained Language Model, Embedding Distillation and Upsampling Strategy for CTC

Pre-training for Speech Translation: CTC Meets Optimal Transport

Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

Towards End-to-end Speech-to-text Translation with Two-pass Decoding

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation