Cantonese neural speech synthesis from found newscasting video data and its speaker adaptation

Raymond Chung
DOI: https://doi.org/10.1109/ISCSLP57327.2022.10037851
2022-12-11
Abstract:This paper investigates the speech synthesis of a Chinese dialect: Cantonese. Without a sizable publicly accessible speech corpus, there are relatively few Text-to-Speech (TTS) research works on this spoken language. In this paper, we present a data mining pipeline to collect large-quantity and high-quality Cantonese audio data from a newscasting video program. The found data enables the training of a deep learning-based multi-speaker speech synthesis model with merely Cantonese audio. We varied the amount of data for speaker adaptation from a pre-trained model. We suggest that fine-tuning the model with 2 audio hours gives a similar synthetic audio quality, in the aspects of voice similarity and mel cepstral distortion (MCD) to groundtruth audio, to that from more audio hours. Furthermore, we conducted a subjective preference test on the synthesized speech samples from the adapted model against some speech samples generated through Microsoft’s cloud-based TTS service. Our samples are more preferred by reviewers in terms of naturalness when uttering texts that are about daily conversation and from novel books.
Computer Science,Linguistics
What problem does this paper attempt to address?