Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction with Multi-Modal Embeddings

Jiangyan Yi,Jianhua Tao,Ruibo Fu,Tao Wang,Chu Yuan Zhang,Chenglong Wang
DOI: https://doi.org/10.1109/taslp.2023.3301235
2023-01-01
Abstract:Prosodic boundaries are still crucial to the naturalness of end-to-end speech synthesis systems. This article proposes to use adversarial multi-task learning to predict prosodic boundaries. Adversarial multi-task learning is utilized to transfer knowledge from an auxiliary POS tagging task to a prosodic boundary prediction task. Furthermore, multi-modal embeddings are composed of contextual word and speech embedding features obtained from the pre-trained bidirectional encoder representations from transformers (BERT) model and Speech2Vec. We can utilize linguistic and acoustic information from large amounts of external text and speech data without prosodic boundary labels. At the inference stage, the prosodic boundary predicting model can use the syntactic features learnt from the POS tagging task without any extra computation cost due to only employing the prosodic boundary predicting task to decode. We conducted experiments on Mandarin datasets. The results show that the models using multi-modal embeddings from the pre-trained BERT and Speech2Vec outperform the models trained with single modal embedding. Furthermore, the models trained with adversarial training obtain further performance gains by up to 2.95% in $F_{1}$ score.
What problem does this paper attempt to address?