Abstract:Nowadays, multilingual and mixed-lingual speech synthesis has become more and more important in information communication across different nations. Towards the key problem and current status in such researches, a new multilingual speech synthesis platform THMTTS is proposed in this paper. In the first part, the system architecture is presented. THMTTS comprises of 3 parts: basic data structure definition part, which provides a general data structure and information logging mechanism; module definition part, which gives researchers power to design and implement new algorithms for speech synthesis; Crystal Sonic, the graphic user interface (GUI), also the main entry point for speech synthesis, encapsulates the observations for data flow, debug information, module management, as well as handling file I/O and controlling wave-out device. We designed a Multi-level data structure without restricting the contents, and the GUI part is able to call the pre-defined enumeration method to iterate all the data stored and expresses it with different appearances, depending on the data type. Logs are also available to be listed in the GUI, as well as outputting to files or other streams. Another feature of this system is the smart module composition. Modules should implement the same interface and be realized in dynamic linking library (DLL). At the system initialization stage, all the modules stored in the specific place will be loaded, and then, users can manually choose which of them to be used and set the linking order. In the second part, multilingual and mixed-lingual support will be discussed. THMTTS aims to provide speech synthesis with language detection for 4 different languages including Chinese, English, Japanese and Korean. The modular structure itself has advantages for multiple language support. The current system also integrated modules that carry out encoding conversion and language detection. Language detection is based on Unicode, which is a general encoding for international use. The paper also proposed a statistical method based on the sum of probabilities to detect different language, which is proved to be effective by the experiment result. In conclusion, the platform provides general and flexible system architecture for speech analysis and synthesis. Based on this, a basic flowchart for mixed-lingual language detection and speech synthesis is introduced. The proposed architecture makes it possible to improve the quality of mixed-lingual speech synthesis.

An HMM-based Cantonese speech synthesis system

A Unified Framework for Multilingual Text-to-speech Synthesis with SSML Specification As Interface

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method

Mandarin-English Mixed TTS Based on HCSIPA

Design and Implementation of a Multilingual Speech Synthesis Platform

A Novel HTS System Using both Continuous HMMs and Discrete HMMs

An Unified and Automatic Approach of Mandarin HTS System.

A Novel Hmm-Based Tts System Using Both Continuous Hmms And Discrete Hmms

Syllable HMM Based Mandarin TTS and Comparison with Concatenative TTS.

Mandarin-English Mixed Text to Speech Based on HCSIPA

The USTC and iFlytek Speech Synthesis Systems for Blizzard Challenge 2007

A Preliminary Study on Deep Learning-based Chinese Text to Taiwanese Speech Synthesis System

Cantonese neural speech synthesis from found newscasting video data and its speaker adaptation

A Miniature Chinese TTS System Based on Tailored Corpus

The USTC System for Blizzard Challenge 2008

The huya multi-speaker and multi-style speech synthesis system for m2voc challenge 2020

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

The USTC System for Blizzard Challenge 2009

Label Transform Based Cross-Language Speaker Adaptation in Bilingual (Mandarin-English) TTS