Abstract:This study introduces a refined approach to Text-to-Speech (TTS) generation that significantly enhances sampling stability across languages, with a particular focus on Hebrew. By leveraging discrete semantic units with higher phonetic correlation obtained from a self-supervised model, our method addresses the inherent instability often encountered in TTS systems, especially those dealing with non-diacriticized scripts like Hebrew. Utilizing HuBERT codes, our model generates discrete representations that are optimized for TTS tasks, thereby reducing the dependency on diacritic-based text processing. This advancement not only simplifies the language modeling process but also improves the robustness and shows controllability of the speech output due to disentenglement properties of the semantic units. The inclusion of a speaker embedding in the vocoder further aids in capturing the unique vocal characteristics of the speaker, contributing to the naturalness of the synthesized speech. Our experimental results demonstrate that this approach not only maintains high performance in Hebrew but also shows adaptability to English, underscoring its effectiveness in enhancing stability in TTS systems universally. Our method, named LOTHM (Language of The Hebrew Man), outperforms existing methods in terms of stability while achieving naturalness and speaker similarity on par with previous methods, making it a compelling choice for future speech synthesis applications. Samples can be found in our page <a class="link-external link-http" href="http://pages.cs.huji.ac.il/adiyoss-lab/LoTHM" rel="external noopener nofollow">this http URL</a> .

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the stability of Text - to - Speech (TTS) systems, especially when dealing with languages without vowel signs (non - phonemic scripts) such as Hebrew. Specifically, the paper proposes a new method to enhance the sampling stability of TTS systems by using discrete semantic units (such as HuBERT codes). ### Detailed description of the main problem 1. **Instability of TTS systems in non - phonemic script languages**: - Languages like Hebrew usually do not contain vowel signs (diacritics) when written, which makes it difficult for TTS systems to accurately generate natural speech. - Traditional TTS systems usually need to add vowel signs to improve pronunciation accuracy, but this increases complexity and processing difficulty. 2. **Limitations of existing methods**: - Existing TTS methods, when dealing with non - phonemic scripts, rely on complex multi - stage models and are prone to error propagation. - These methods usually require multiple samplings to obtain satisfactory results, leading to the problem of sampling instability. 3. **Improving stability and controllability**: - The method proposed in the paper aims to simplify the language modeling process and reduce the dependence on vowel signs by using discrete semantic units (such as HuBERT codes). - Through this method, not only can the stability of the TTS system be improved, but also the controllability of the generated speech can be enhanced, such as separating the speaker identity and speech rate. ### Overview of the solution - **Using discrete semantic units**: Generate discrete representations with higher speech relevance through self - supervised models (such as HuBERT), thereby simplifying language modeling and improving stability. - **Integrating tasks**: Integrate semantic unit extraction and speech generation into a single language model to reduce error propagation in cascaded models. - **Adding speaker embeddings**: Add speaker embedding vectors in the vocoder to capture unique voice features and further improve the naturalness of the synthesized speech. ### Experimental results The experimental results show that this method not only performs excellently on Hebrew but also shows good adaptability on English, proving its effectiveness in improving the stability of TTS systems. ### Formula examples To ensure the correctness and readability of the formulas, the following are some formula examples involved in the paper: - **Discrete sequence mapping**: \[ z_{\text{ac}} \in \{N_{\text{ac}}\}^{f_r \cdot d_p} \] where \( N_{\text{ac}} \) is the vocabulary size, \( f_r \) is the time - axis frame rate of the tokenizer, and \( d_p \) is the duration of the audio cue. - **Cross - entropy loss**: \[ \mathcal{L}_{\text{CE}} = -\sum_{i = 1}^{T} y_i \log(\hat{y}_i) \] where \( y_i \) is the true label and \(\hat{y}_i \) is the predicted probability. Through these improvements, the method proposed in the paper significantly improves the stability and controllability of TTS systems, especially when dealing with non - phonemic script languages.

Enhancing TTS Stability in Hebrew using Discrete Semantic Units

A Language Modeling Approach to Diacritic-Free Hebrew TTS

VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop

Prosody-controllable spontaneous TTS with neural HMMs

High quality, lightweight and adaptable TTS using LPCNet

On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models

Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS

HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing

Towards Zero-Shot Text-To-Speech for Arabic Dialects

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations

SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection

ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams

SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech

Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation

Into-TTS : Intonation Template Based Prosody Control System

StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech

Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models