Abstract:Current text-to-speech (TTS) models can produce natural speech but often fail to synthesize long-form speech properly when only sentence-level corpus is available. The failure is mainly due to 1) poor length generalization of the acoustic model, 2) lack of appropriate pause marks in the inference text, and 3) absence of contextual information during training. We propose Content Extrapolation, which includes introducing Moving Average Equipped Gated Attention (MEGA) to improve the model's generalization for addressing 1) and presenting the Global-information-enhanced Classification Pause Insertion model (GCPI) to address 2). For 3), we propose LLM-based Contextual Enrichment (LLM-CE) to generate multiple sets of different contexts. Experiments show that the proposed methods solve the above issues and successfully generate long-form speech with clear pronunciation and natural prosody using only sentence-level corpus, reducing training costs.

Synthesizing Long-Form Speech Merely from Sentence-Level Corpus with Content Extrapolation and LLM Contextual Enrichment