A Transfer and Multi-Task Learning Based Approach for MOS Prediction

Xiaohai Tian,Kaiqi Fu,Shaojun Gao,Yiwei Gu,Kai Wang,Wei Li,Zejun Ma
DOI: https://doi.org/10.21437/interspeech.2022-10022
2022-01-01
Abstract:Automatic speech quality assessment aims to train a model ca-pable of automatically measuring the performance of synthesis systems. This is a challenging task, especially when the domain of the evaluation data is different to that of the training data. In this paper, we present a multi-task and transfer learning framework for predicting the mean opinion score (MOS) of synthetic speech from different domains. Specifically, the proposed framework consists of a common encoder shared by data from different domains and two domain-specific decoders for in-domain and out-of-domain data, respectively. A wav2vec2 fine-tuned for phone recognition task is utilized as an initializa-tion of the shared encoder to make full use of its learned knowledge from large number of unlabeled data and task-related labeled data. The experiments are conducted on the VoiceMOS Challenge dataset. The results show that the proposed system outperforms the baseline solutions for both in-domain and out-of-domain MOS prediction scenarios. Further, we show that the wav2vec2 encoder fine-tuned for phone recognition can be transferred to boost the performance of the MOS prediction.
What problem does this paper attempt to address?