Multi-speaker Chinese news broadcasting system based on improved Tacotron2
Wei Zhao,Yue Lian,Jianping Chai,Zhongwen Tu
DOI: https://doi.org/10.1007/s11042-023-15279-z
IF: 2.577
2023-05-04
Multimedia Tools and Applications
Abstract:In recent years, the demand for news broadcasting has increased with the explosion of information. The automatic news broadcasting system based on deep learning text-to-speech technology can solve the problems of working time limitation and errors caused by manual broadcasting. Most of the existing speech synthesis technologies cannot switch speakers in real-time and cannot solve a series of additional news broadcasting scenarios problems in Chinese. In this paper, we propose a multi-speaker Chinese news broadcasting system with switchable timbres based on our established Chinese news corpus CNews dataset for training. This system uses the CPM module to convert Chinese into pinyin phonemes more accurately. Then a timbre encoder is used to construct multi-speaker timbre feature embeddings. As for the problem of having long texts in news, the acoustic model of this system is improved based on Tacotron2 and uses Discrete Grave attention as the attention mechanism so that the model reduces the demand for audio data in the training phase and better extracts the information from the context. The HiFi-GAN vocoder is also used to generate time domain waveforms instead of the original WaveNet, reducing the synthesis time and improving the voice quality of the synthesized speech. Experiments show that the system can change the target timbre flexibly according to the reference speech compared with Tacotron2. Moreover, it is able to synthesize speech with the prosody and style of the target presenter under the training of limited data and with better naturalness as well as faster inference speed, which can be used for real-time news broadcasting.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering