Abstract:Nowadays, multilingual and mixed-lingual speech synthesis has become more and more important in information communication across different nations. Towards the key problem and current status in such researches, a new multilingual speech synthesis platform THMTTS is proposed in this paper. In the first part, the system architecture is presented. THMTTS comprises of 3 parts: basic data structure definition part, which provides a general data structure and information logging mechanism; module definition part, which gives researchers power to design and implement new algorithms for speech synthesis; Crystal Sonic, the graphic user interface (GUI), also the main entry point for speech synthesis, encapsulates the observations for data flow, debug information, module management, as well as handling file I/O and controlling wave-out device. We designed a Multi-level data structure without restricting the contents, and the GUI part is able to call the pre-defined enumeration method to iterate all the data stored and expresses it with different appearances, depending on the data type. Logs are also available to be listed in the GUI, as well as outputting to files or other streams. Another feature of this system is the smart module composition. Modules should implement the same interface and be realized in dynamic linking library (DLL). At the system initialization stage, all the modules stored in the specific place will be loaded, and then, users can manually choose which of them to be used and set the linking order. In the second part, multilingual and mixed-lingual support will be discussed. THMTTS aims to provide speech synthesis with language detection for 4 different languages including Chinese, English, Japanese and Korean. The modular structure itself has advantages for multiple language support. The current system also integrated modules that carry out encoding conversion and language detection. Language detection is based on Unicode, which is a general encoding for international use. The paper also proposed a statistical method based on the sum of probabilities to detect different language, which is proved to be effective by the experiment result. In conclusion, the platform provides general and flexible system architecture for speech analysis and synthesis. Based on this, a basic flowchart for mixed-lingual language detection and speech synthesis is introduced. The proposed architecture makes it possible to improve the quality of mixed-lingual speech synthesis.

Automatic Generation Of Synthesis Units For Trainable Text-To-Speech Systems

Whistler: a Trainable Text-to-speech System

Recent Improvements on Microsoft's Trainable Text-to-speech System-Whistler

Trainable Unit Selection Speech Synthesis under Statistical Framework

Improved unit selection speech synthesis method utilizing subjective evaluation results on synthetic speech

Recent improvements on michael ’ s trainable sample paper system-whistle

Statistical Acoustic Model Based Unit Selection Algorithm for Speech Synthesis

Unit Selection Speech Synthesis Integrating Automatic Error Detection

Hybrid Unit Model Based Non-uniform Unit Selection

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

Optimization Method for Unit Selection Speech Synthesis Based on Synthesis Quality Predictions

Perceptual Evaluation Weight Training for Text-to-Speech Synthesis

HMM-based Smoothing for Concatenative Speech Synthesis.

The USTC System for Blizzard Challenge 2008

BLSTM Guided Unit Selection Synthesis System for Blizzard Challenge 2016

Design and Implementation of a Multilingual Speech Synthesis Platform

The USTC and iFlytek Speech Synthesis Systems for Blizzard Challenge 2007

Perceptual Clustering Based Unit Selection Optimization for Concatenative Text-to-speech Synthesis

Building HMM based unit-selection speech synthesis system using synthetic speech naturalness evaluation score