Learning Deep and Wide Contextual Representations Using BERT for Statistical Parametric Speech Synthesis

Ya-Jie Zhang,Zhen-Hua Ling
DOI: https://doi.org/10.1145/3458380.3458405
2021-02-26
Abstract:In this paper, we propose a method of learning deep and wide contextual representations for statistical parametric speech synthesis (SPSS) using BERT, a pre-trained language representation model. Traditional acoustic models in SPSS utilize phoneme sequences and prosody labels as input, and can not make full use of the deep linguistic representations of current and surrounding sentences. Therefore, this paper designs two context encoders, i.e., a sentence-window context encoder and a paragraph-level context encoder, to integrate the contextual representations extracted from multiple sentences by BERT into Tacotron2 via an extra attention module. The parameters of BERT are pre-trained and then fine-tuned together with other components in the model. Experimental results on the Blizzard Challenge 2019 dataset show that both context encoders can reduce the errors of acoustic feature prediction and improve the subjective performance of synthetic speech comparing with the baseline Tacotron2 model.
What problem does this paper attempt to address?