Abstract:Token-based text-to-speech (TTS) models have emerged as a promising avenue for generating natural and realistic speech, yet they grapple with low pronunciation accuracy, speaking style and timbre inconsistency, and a substantial need for diverse training data. In response, we introduce a novel hierarchical acoustic modeling approach complemented by a tailored data augmentation strategy and train it on the combination of real and synthetic data, scaling the data size up to 650k hours, leading to the zero-shot TTS model with 0.8B parameters. Specifically, our method incorporates a latent variable sequence containing supplementary acoustic information based on refined self-supervised learning (SSL) discrete units into the TTS model by a predictor. This significantly mitigates pronunciation errors and style mutations in synthesized speech. During training, we strategically replace and duplicate segments of the data to enhance timbre uniformity. Moreover, a pretrained few-shot voice conversion model is utilized to generate a plethora of voices with identical content yet varied timbres. This facilitates the explicit learning of utterance-level one-to-many mappings, enriching speech diversity and also ensuring consistency in timbre. Comparative experiments (Demo page: https://anonymous.4open.science/w/ham-tts/)demonstrate our model's superiority over VALL-E in pronunciation precision and maintaining speaking style, as well as timbre continuity.

HMM based speech synthesis with Global Variance Training method

Global Variance Modeling on Frequency Domain Delta LSP for HMM-based Speech Synthesis

Improving HMM Based Speech Synthesis by Reducing Over-Smoothing Problems

Global Variance Modeling on the Log Power Spectrum of LSPs for HMM-based Speech Synthesis

A state duration generation algorithm considering global variance for HMM-based speech synthesis

Statistical Modification Based Post-Filtering Technique for HMM-based Speech Synthesis

A Novel HTS System Using both Continuous HMMs and Discrete HMMs

Amplitude Spectrum Based Excitation Model For Hmm-Based Speech Synthesis

A Novel Hmm-Based Tts System Using Both Continuous Hmms And Discrete Hmms

Speech Synthesis Based on Gaussian Conditional Random Fields

Pitch-Scaled Spectrum Based Excitation Model for HMM-based Speech Synthesis

Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems

Formant-Controlled HMM-Based Speech Synthesis.

A Hierarchical Viterbi Algorithm For Mandarin Hybrid Speech Synthesis System

Inverse Filtering Based Harmonic Plus Noise Excitation Model for HMM-Based Speech Synthesis

Improving F0 prediction using bidirectional associative memories and syllable-level F0 features for HMM-based Mandarin speech synthesis

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

Hierarchical Generative Modeling for Controllable Speech Synthesis

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Voiced/unvoiced Decision Algorithm for HMM-based Speech Synthesis

Global variance equalization for improving deep neural network based speech enhancement