Code-switched speech synthesis using bilingual phonetic posteriorgram with only monolingual corpora

Yuewen Cao,Songxiang Liu,Xixin Wu,Shiyin Kang,Peng Liu,Zhiyong Wu,Xunying Liu,Dan Su,Dong Yu,Helen Meng
DOI: https://doi.org/10.1109/icassp40776.2020.9053094
2020-01-01
Abstract:Synthesizing fluent code-switched (CS) speech with consistent voice using only monolingual corpora is still a challenging task, since language alternation seldom occurs during training and the speaker identity is directly correlated with language. In this paper, we present a bilingual phonetic posteriorgram (PPG) based CS speech synthesizer using only monolingual corpora. The bilingual PPG is used to bridge across speakers and languages, which is formed by stacking two monolingual PPGs extracted from two monolingual speaker-independent speech recognition systems. It is assumed that bilingual PPG can represent the articulation of speech sounds speaker-independently and captures accurate phonetic information of both languages in the same feature space. The proposed model first extracts bilingual PPGs from training data. Then an encoder-decoder based model is used to learn the relationship between input text and bilingual PPGs, and the bilingual PPGs are mapped to acoustic features using bidirectional long-short term memory based model conditioned on speaker embedding to control speaker identity. Experiments validate the effectiveness of the proposed model in terms of speech intelligibility, audio fidelity and speaker consistency of the generated code-switched speech.
What problem does this paper attempt to address?