Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation

Gerard I. Gállego,Roy Fejgin,Chunghsin Yeh,Xiaoyu Liu,Gautam Bhattacharya
2024-09-17
Abstract:Audio token modeling has become a powerful framework for speech synthesis, with two-stage approaches employing semantic tokens remaining prevalent. In this paper, we aim to simplify this process by introducing a semantic knowledge distillation method that enables high-quality speech generation in a single stage. Our proposed model improves speech quality, intelligibility, and speaker similarity compared to a single-stage baseline. Although two-stage systems still lead in intelligibility, our model significantly narrows the gap while delivering comparable speech quality. These findings showcase the potential of single-stage models to achieve efficient, high-quality TTS with a more compact and streamlined architecture.
Sound,Artificial Intelligence,Audio and Speech Processing,Signal Processing
What problem does this paper attempt to address?
The main problem this paper attempts to address is simplifying the complexity of current two-stage Text-to-Speech (TTS) systems while maintaining or improving the quality, intelligibility, and speaker similarity of the synthesized speech. Specifically, the authors propose a new single-stage Non-Autoregressive (NAR) TTS model—NARS IS (NAR Single Stage TTS), which incorporates Semantic Knowledge Distillation (SKD) to integrate semantic information during training without adding extra processing steps during inference. ### Main Issues: 1. **Simplifying the Process**: Existing two-stage TTS systems perform well in terms of speech quality and intelligibility but are complex and computationally expensive because they require separate generation of semantic and audio tokens. NARS IS aims to simplify this process by merging these two stages into one. 2. **Improving Performance**: While single-stage TTS models are more efficient, they typically lag behind two-stage systems in terms of speech quality and intelligibility. NARS IS significantly improves the performance of single-stage models in these areas by introducing SKD, narrowing the gap with two-stage systems. 3. **Maintaining a Compact Architecture**: Traditional single-stage models are often structurally complex. NARS IS achieves a more compact and streamlined architecture through efficient design, enhancing efficiency while maintaining high quality. ### Solution: - **Semantic Knowledge Distillation (SKD)**: By extracting semantic information from pre-trained self-supervised speech encoders (such as HuBERT) and injecting it into the TTS model during training, NARS IS can better understand and generate high-quality speech. - **Non-Autoregressive Model (NAR)**: Using a non-autoregressive approach, NARS IS can generate audio tokens in parallel, significantly speeding up inference. - **Multi-Task Loss Function**: The model employs a multi-task loss function that combines audio token and semantic feature losses, ensuring high-quality audio generation while capturing semantic information. ### Experimental Results: - **Objective Evaluation**: Experimental results show that NARS IS outperforms single-stage baseline models on multiple metrics (such as Word Error Rate (WER), Speaker Similarity Score (SSS), Mel-Cepstral Distortion (MCD), and UTMOS score) and even approaches or exceeds two-stage systems on some metrics. - **Subjective Evaluation**: Human listening tests indicate that NARS IS receives high ratings in naturalness, prosody, pitch, quality, and intelligibility, further validating its effectiveness in practical applications. In summary, this paper successfully addresses key performance and efficiency issues in single-stage TTS systems by introducing SKD and an optimized NAR model design, providing new directions for future speech synthesis research.