Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation

Xinhan Di,Zihao Chen,Yunming Liang,Junjie Zheng,Yihua Wang,Chaofan Ding
2024-08-01
Abstract:Large-scale text-to-speech (TTS) models have made significant progress recently.However, they still fall short in the generation of Chinese dialectal speech. Toaddress this, we propose Bailing-TTS, a family of large-scale TTS models capable of generating high-quality Chinese dialectal speech. Bailing-TTS serves as a foundation model for Chinese dialectal speech generation. First, continual semi-supervised learning is proposed to facilitate the alignment of text tokens and speech tokens. Second, the Chinese dialectal representation learning is developed using a specific transformer architecture and multi-stage training processes. With the proposed design of novel network architecture and corresponding strategy, Bailing-TTS is able to generate Chinese dialectal speech from text effectively and efficiently. Experiments demonstrate that Bailing-TTS generates Chinese dialectal speech towards human-like spontaneous representation. Readers are encouraged to listen to demos at \url{<a class="link-external link-https" href="https://c9412600.github.io/bltts_tech_report/index.html" rel="external noopener nofollow">this https URL</a>}.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily addresses the issue of Chinese dialect speech synthesis and proposes a large-scale Text-To-Speech (TTS) model family named Bailing-TTS. Although existing large-scale TTS models have made significant progress in non-dialect speech generation, they still have shortcomings in generating high-quality Chinese dialect speech. To solve this problem, the research team developed Bailing-TTS, aiming to achieve the conversion from text to high-quality, natural, and fluent Chinese dialect speech. The main contributions of Bailing-TTS include: 1. **Continuous Semi-Supervised Learning Framework**: To facilitate the alignment between text and speech annotations, a continuous semi-supervised learning strategy is proposed, which helps in handling multimodal data. 2. **Chinese Dialect Representation Learning**: Optimizing the representation learning of Chinese dialects through a specific Transformer architecture and a multi-stage training process to improve the quality of generated speech. 3. **Hierarchical Reinforcement Post-Training Extension Techniques**: Designing a series of hierarchical reinforcement learning strategies to further enhance the quality of Chinese dialect speech generation. Experimental results show that Bailing-TTS can generate natural and fluent Chinese dialect speech close to human level, with excellent performance in both objective and subjective evaluations. It also demonstrates good performance in zero-shot learning and fine-tuning learning. Additionally, the study discusses the practical application potential and limitations of Bailing-TTS and envisions future work directions, including support for multiple modal inputs and the ability to generate audio content such as music.