Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

Paarth Neekhara,Shehzeen Hussain,Subhankar Ghosh,Jason Li,Rafael Valle,Rohan Badlani,Boris Ginsburg
2024-06-26
Abstract:Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers. However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech (referred to as hallucinations or attention errors), especially when the text contains multiple occurrences of the same token. We examine these challenges in an encoder-decoder transformer model and find that certain cross-attention heads in such models implicitly learn the text and speech alignment when trained for predicting speech tokens for a given text. To make the alignment more robust, we propose techniques utilizing CTC loss and attention priors that encourage monotonic cross-attention over the text tokens. Our guided attention training technique does not introduce any new learnable parameters and significantly improves robustness of LLM-based TTS models.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Text - to - Speech (TTS) systems based on Large Language Models (LLMs) are not robust when generating speech. In particular, when dealing with texts containing repetitive words or complex structures, they are prone to repeating words, omitting words and alignment errors (i.e., hallucination or attention errors). These problems make LLM - based TTS models unreliable in practical applications. Specifically, the author points out that when training the encoder - decoder Transformer model for predicting speech tokens, some cross - attention heads will implicitly learn the alignment between text and speech. However, because this alignment is unconstrained during the training process, misaligned synthesis may occur during inference. To improve the robustness of the model, the author proposes a technique using CTC loss and attention priors to encourage monotonic cross - attention to align text tokens. This method does not need to introduce new learnable parameters and significantly improves the robustness of LLM - based TTS models. ### Key Problem Summary 1. **Hallucination and Attention Errors**: - LLM - based TTS models may experience hallucination (such as repeating words, omitting words) when generating speech, especially when dealing with texts containing repetitive words. 2. **Alignment Problems**: - Some cross - attention heads in the encoder - decoder Transformer model will implicitly learn the alignment between text and speech, but due to the lack of constraints, this alignment may be unstable during inference. 3. **Improving Robustness**: - The author proposes a method using CTC loss and attention priors to encourage monotonic cross - attention to align text tokens, thereby improving the robustness of the model. ### Solutions - **CTC Loss**: Calculate the Connectionist Temporal Classification (CTC) loss to encourage effective monotonic alignment. - **Attention Priors**: Use a static 2D beta - binomial prior matrix to guide the attention mechanism to be more inclined to monotonic alignment. These methods work together to make the TTS model more robust when dealing with complex texts and reduce the occurrence of hallucination and attention errors.