Abstract:Speech synthesis has come a long way as current text-to-speech (TTS) models can now generate natural human-sounding speech. However, most of the TTS research focuses on using adult speech data and there has been very limited work done on child speech synthesis. This study developed and validated a training pipeline for fine-tuning state-of-the-art (SOTA) neural TTS models using child speech datasets. This approach adopts a multi-speaker TTS retuning workflow to provide a transfer-learning pipeline. A publicly available child speech dataset was cleaned to provide a smaller subset of approximately 19 hours, which formed the basis of our fine-tuning experiments. Both subjective and objective evaluations were performed using a pretrained MOSNet for objective evaluation and a novel subjective framework for mean opinion score (MOS) evaluations. Subjective evaluations achieved the MOS of 3.95 for speech intelligibility, 3.89 for voice naturalness, and 3.96 for voice consistency. Objective evaluation using a pretrained MOSNet showed a strong correlation between real and synthetic child voices. Speaker similarity was also verified by calculating the cosine similarity between the embeddings of utterances. An automatic speech recognition (ASR) model is also used to provide a word error rate (WER) comparison between the real and synthetic child voices. The final trained TTS model was able to synthesize child-like speech from reference audio samples as short as 5 seconds.

Text-to-Speech Pipeline for Swiss German -- A comparison

ViSPer: A Multilingual TTS Approach Based on VITS Using Deep Feature Loss

STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions

Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model

Spaiche: Extending State-of-the-Art ASR Models to Swiss German Dialects

SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German

On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition

Dialect Transfer for Swiss German Speech Translation

Dialectal Speech Recognition and Translation of Swiss German Speech to Standard German Text: Microsoft's Submission to SwissText 2021

A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis

SDS-200: A Swiss German Speech to Standard German Text Corpus

TTSDS -- Text-to-Speech Distribution Score

U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech

Objective Evaluation Methods for Chinese Text-To-Speech Systems

SVTS: Scalable Video-to-Speech Synthesis

2nd Swiss German Speech to Standard German Text Shared Task at SwissText 2022

A Swiss German Dictionary: Variation in Speech and Writing

A Benchmark for Evaluating Machine Translation Metrics on Dialects Without Standard Orthography

Does Whisper understand Swiss German? An automatic, qualitative, and human evaluation

SwissBERT: The Multilingual Language Model for Switzerland