Abstract:Recently, researchers have shown an increasing interest in automatically predicting the subjective evaluation for speech synthesis systems. This prediction is a challenging task, especially on the out-of-domain test set. In this paper, we proposed a novel fusion model for MOS prediction that combines supervised and unsupervised approaches. In the supervised aspect, we developed an SSL-based predictor called LE-SSL-MOS. The LE-SSL-MOS utilizes pre-trained self-supervised learning models and further improves prediction accuracy by utilizing the opinion scores of each utterance in the listener enhancement branch. In the unsupervised aspect, two steps are contained: we fine-tuned the unit language model (ULM) using highly intelligible domain data to improve the correlation of an unsupervised metric - SpeechLMScore. Another is that we utilized ASR confidence as a new metric with the help of ensemble learning. To our knowledge, this is the first architecture that fuses supervised and unsupervised methods for MOS prediction. With these approaches, our experimental results on the VoiceMOS Challenge 2023 show that LE-SSL-MOS performs better than the baseline. Our fusion system achieved an absolute improvement of 13% over LE-SSL-MOS on the noisy and enhanced speech track. Our system ranked 1st and 2nd, respectively, in the French speech synthesis track and the challenge's noisy and enhanced speech track.

An Analysis of Linear Complexity Attention Substitutes with BEST-RQ

Linear-Complexity Self-Supervised Learning for Speech Processing

SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding

Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition

Open Implementation and Study of BEST-RQ for Speech Processing

Investigating Self-Supervised Learning for Speech Enhancement and Separation

Audio-visual fine-tuning of audio-only ASR models

Analysis of Self-Supervised Speech Models on Children's Speech and Infant Vocalizations

LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener Enhancement

Efficient infusion of self-supervised representations in Automatic Speech Recognition

Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads

Self-supervised Speech Representations Still Struggle with African American Vernacular English

An Experimental Study: Assessing the Combined Framework of WavLM and BEST-RQ for Text-to-Speech Synthesis

CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification

Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models

An Empirical Analysis of Speech Self-Supervised Learning at Multiple Resolutions

More Speaking or More Speakers?

AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech Recognition using Agnostic Contrastive Mixup

Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction

MiniSUPERB: Lightweight Benchmark for Self-supervised Speech Models

Audio self-supervised learning: A survey