Abstract:Recently, researchers have shown an increasing interest in automatically predicting the subjective evaluation for speech synthesis systems. This prediction is a challenging task, especially on the out-of-domain test set. In this paper, we proposed a novel fusion model for MOS prediction that combines supervised and unsupervised approaches. In the supervised aspect, we developed an SSL-based predictor called LE-SSL-MOS. The LE-SSL-MOS utilizes pre-trained self-supervised learning models and further improves prediction accuracy by utilizing the opinion scores of each utterance in the listener enhancement branch. In the unsupervised aspect, two steps are contained: we fine-tuned the unit language model (ULM) using highly intelligible domain data to improve the correlation of an unsupervised metric - SpeechLMScore. Another is that we utilized ASR confidence as a new metric with the help of ensemble learning. To our knowledge, this is the first architecture that fuses supervised and unsupervised methods for MOS prediction. With these approaches, our experimental results on the VoiceMOS Challenge 2023 show that LE-SSL-MOS performs better than the baseline. Our fusion system achieved an absolute improvement of 13% over LE-SSL-MOS on the noisy and enhanced speech track. Our system ranked 1st and 2nd, respectively, in the French speech synthesis track and the challenge's noisy and enhanced speech track.

Towards Robust Speech Representation Learning for Thousands of Languages

Unispeech-Sat: Universal Speech Representation Learning with Speaker Aware Pre-Training

Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning

Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised Models

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

Boosting Self-Supervised Embeddings for Speech Enhancement

Target Speech Extraction with Pre-trained Self-supervised Learning Models

SUPERB: Speech Processing Universal PERformance Benchmark

LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener Enhancement

SUPERB @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning

UniSpeech at scale: An Empirical Study of Pre-training Method on Large-Scale Speech Recognition Dataset

k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning

ML-SUPERB: Multilingual Speech Universal PERformance Benchmark

Improving Self-Supervised Learning for Speech Recognition with Intermediate Layer Supervision

On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis

Scaling Speech Technology to 1,000+ Languages

Robust Speech Recognition via Large-Scale Weak Supervision

End-to-End Cross-Lingual Spoken Language Understanding Model with Multilingual Pretraining.

LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech