Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation

Xinhan Di,Zihao Chen,Yunming Liang,Junjie Zheng,Yihua Wang,Chaofan Ding

2024-08-01

Abstract:Large-scale text-to-speech (TTS) models have made significant progress recently.However, they still fall short in the generation of Chinese dialectal speech. Toaddress this, we propose Bailing-TTS, a family of large-scale TTS models capable of generating high-quality Chinese dialectal speech. Bailing-TTS serves as a foundation model for Chinese dialectal speech generation. First, continual semi-supervised learning is proposed to facilitate the alignment of text tokens and speech tokens. Second, the Chinese dialectal representation learning is developed using a specific transformer architecture and multi-stage training processes. With the proposed design of novel network architecture and corresponding strategy, Bailing-TTS is able to generate Chinese dialectal speech from text effectively and efficiently. Experiments demonstrate that Bailing-TTS generates Chinese dialectal speech towards human-like spontaneous representation. Readers are encouraged to listen to demos at \url{<a class="link-external link-https" href="https://c9412600.github.io/bltts_tech_report/index.html" rel="external noopener nofollow">this https URL</a>}.

Computation and Language,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The paper primarily addresses the issue of Chinese dialect speech synthesis and proposes a large-scale Text-To-Speech (TTS) model family named Bailing-TTS. Although existing large-scale TTS models have made significant progress in non-dialect speech generation, they still have shortcomings in generating high-quality Chinese dialect speech. To solve this problem, the research team developed Bailing-TTS, aiming to achieve the conversion from text to high-quality, natural, and fluent Chinese dialect speech. The main contributions of Bailing-TTS include: 1. **Continuous Semi-Supervised Learning Framework**: To facilitate the alignment between text and speech annotations, a continuous semi-supervised learning strategy is proposed, which helps in handling multimodal data. 2. **Chinese Dialect Representation Learning**: Optimizing the representation learning of Chinese dialects through a specific Transformer architecture and a multi-stage training process to improve the quality of generated speech. 3. **Hierarchical Reinforcement Post-Training Extension Techniques**: Designing a series of hierarchical reinforcement learning strategies to further enhance the quality of Chinese dialect speech generation. Experimental results show that Bailing-TTS can generate natural and fluent Chinese dialect speech close to human level, with excellent performance in both objective and subjective evaluations. It also demonstrates good performance in zero-shot learning and fine-tuning learning. Additionally, the study discusses the practical application potential and limitations of Bailing-TTS and envisions future work directions, including support for multiple modal inputs and the ability to generate audio content such as music.

Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation

Mandarin Text-to-Speech Front-End with Lightweight Distilled Convolution Network

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition

Knowledge-based Linguistic Encoding for End-to-End Mandarin Text-to-Speech Synthesis

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario

SR-TTS: a rhyme-based end-to-end speech synthesis system

A Preliminary Study on Deep Learning-based Chinese Text to Taiwanese Speech Synthesis System

DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer

The NTU-AISG Text-to-speech System for Blizzard Challenge 2020