Learning Language and Speaker Information for Code-Switch Speech Synthesis with Limited Data.
Mengxin Chai,Shaotong Guo,Cheng Gong,Longbiao Wang,Jianwu Dang,Ju Zhang
DOI: https://doi.org/10.1109/asru51503.2021.9687961
2021-01-01
Abstract:End-to-end speech synthesis demonstrates remarkable performance in monolingual speech, whereas code-switching (CS) speech synthesis remains a challenge owing to the sparsity of data and diverse syntactic structures across languages. Previous studies show that large mixed-lingual corpora are essential for effective learning text/language representations and target speaker information. In this study, we propose a method using three independent encoders (text, language, and speaker), which requires only a small amount of mixed-lingual data to realize the CS speech synthesis of Mandarin and English. Additionally, to distinguish between Mandarin and English, we investigate two text-representation methods: (1) the implicit method, which uses Pinyin and the CMU11http://www.speech.cs.cmu.edu/cgi-bin/cmudict dictionary to represent both languages; and (2) the explicit method, which uses language markers i.e., masks, to differentiate the languages. Through our proposed method, we can improve synthesized speech in terms of quality and speaker similarity using a small amount of mixed-lingual data. In addition, the experimental results demonstrate that the proposed method achieves performance improvement of 0.06 in terms of the mean opinion score and absolute improvement of 0.64% in terms of the character error rate compared to the baseline method.