Abstract:Audio token modeling has become a powerful framework for speech synthesis, with two-stage approaches employing semantic tokens remaining prevalent. In this paper, we aim to simplify this process by introducing a semantic knowledge distillation method that enables high-quality speech generation in a single stage. Our proposed model improves speech quality, intelligibility, and speaker similarity compared to a single-stage baseline. Although two-stage systems still lead in intelligibility, our model significantly narrows the gap while delivering comparable speech quality. These findings showcase the potential of single-stage models to achieve efficient, high-quality TTS with a more compact and streamlined architecture.

What problem does this paper attempt to address?

The main problem this paper attempts to address is simplifying the complexity of current two-stage Text-to-Speech (TTS) systems while maintaining or improving the quality, intelligibility, and speaker similarity of the synthesized speech. Specifically, the authors propose a new single-stage Non-Autoregressive (NAR) TTS model—NARS IS (NAR Single Stage TTS), which incorporates Semantic Knowledge Distillation (SKD) to integrate semantic information during training without adding extra processing steps during inference. ### Main Issues: 1. **Simplifying the Process**: Existing two-stage TTS systems perform well in terms of speech quality and intelligibility but are complex and computationally expensive because they require separate generation of semantic and audio tokens. NARS IS aims to simplify this process by merging these two stages into one. 2. **Improving Performance**: While single-stage TTS models are more efficient, they typically lag behind two-stage systems in terms of speech quality and intelligibility. NARS IS significantly improves the performance of single-stage models in these areas by introducing SKD, narrowing the gap with two-stage systems. 3. **Maintaining a Compact Architecture**: Traditional single-stage models are often structurally complex. NARS IS achieves a more compact and streamlined architecture through efficient design, enhancing efficiency while maintaining high quality. ### Solution: - **Semantic Knowledge Distillation (SKD)**: By extracting semantic information from pre-trained self-supervised speech encoders (such as HuBERT) and injecting it into the TTS model during training, NARS IS can better understand and generate high-quality speech. - **Non-Autoregressive Model (NAR)**: Using a non-autoregressive approach, NARS IS can generate audio tokens in parallel, significantly speeding up inference. - **Multi-Task Loss Function**: The model employs a multi-task loss function that combines audio token and semantic feature losses, ensuring high-quality audio generation while capturing semantic information. ### Experimental Results: - **Objective Evaluation**: Experimental results show that NARS IS outperforms single-stage baseline models on multiple metrics (such as Word Error Rate (WER), Speaker Similarity Score (SSS), Mel-Cepstral Distortion (MCD), and UTMOS score) and even approaches or exceeds two-stage systems on some metrics. - **Subjective Evaluation**: Human listening tests indicate that NARS IS receives high ratings in naturalness, prosody, pitch, quality, and intelligibility, further validating its effectiveness in practical applications. In summary, this paper successfully addresses key performance and efficiency issues in single-stage TTS systems by introducing SKD and an optimized NAR model design, providing new directions for future speech synthesis research.

Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation

High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model

Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved Intelligibility

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction

Mandarin Text-to-Speech Front-End with Lightweight Distilled Convolution Network

VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

Towards High-Quality Neural TTS for Low-Resource Languages by Learning Compact Speech Representations

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS