Abstract:Despite rapid advancements in TTS models, a consistent and robust human evaluation framework is still lacking. For example, MOS tests fail to differentiate between similar models, and CMOS's pairwise comparisons are time-intensive. The MUSHRA test is a promising alternative for evaluating multiple TTS systems simultaneously, but in this work we show that its reliance on matching human reference speech unduly penalises the scores of modern TTS systems that can exceed human speech quality. More specifically, we conduct a comprehensive assessment of the MUSHRA test, focusing on its sensitivity to factors such as rater variability, listener fatigue, and reference bias. Based on our extensive evaluation involving 471 human listeners across Hindi and Tamil we identify two primary shortcomings: (i) reference-matching bias, where raters are unduly influenced by the human reference, and (ii) judgement ambiguity, arising from a lack of clear fine-grained guidelines. To address these issues, we propose two refined variants of the MUSHRA test. The first variant enables fairer ratings for synthesized samples that surpass human reference quality. The second variant reduces ambiguity, as indicated by the relatively lower variance across raters. By combining these approaches, we achieve both more reliable and more fine-grained assessments. We also release MANGO, a massive dataset of 47,100 human ratings, the first-of-its-kind collection for Indian languages, aiding in analyzing human preferences and developing automatic metrics for evaluating TTS systems.

Objective Evaluation Methods for Chinese Text-To-Speech Systems

TTSDS -- Text-to-Speech Distribution Score

Automatic Prosody Quality Evaluation of Mandarin Speech

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

Total Quality Evaluation of Speech Synthesis Systems.

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model

Prosody Variation: Application to Automatic Prosody Evaluation of Mandarin Speech

Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech

Why We Should Report the Details in Subjective Evaluation of TTS More Rigorously

Design of English text-to-speech conversion algorithm based on machine learning

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Learning Prosodic Patterns for Mandarin Speech Synthesis

NaturalSpeech: End-to-End Text-to-Speech Synthesis with Human-Level Quality

Multilingual Speech Evaluation: Case Studies on English, Malay and Tamil

Improved unit selection speech synthesis method utilizing subjective evaluation results on synthetic speech

Chinese Prosody Generation Based on C-ToBI Representation for Text-To-Speech

A Comparative Analysis of Pretrained Language Models for Text-to-Speech

Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation

Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis