Abstract:Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers. However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech (referred to as hallucinations or attention errors), especially when the text contains multiple occurrences of the same token. We examine these challenges in an encoder-decoder transformer model and find that certain cross-attention heads in such models implicitly learn the text and speech alignment when trained for predicting speech tokens for a given text. To make the alignment more robust, we propose techniques utilizing CTC loss and attention priors that encourage monotonic cross-attention over the text tokens. Our guided attention training technique does not introduce any new learnable parameters and significantly improves robustness of LLM-based TTS models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Text - to - Speech (TTS) systems based on Large Language Models (LLMs) are not robust when generating speech. In particular, when dealing with texts containing repetitive words or complex structures, they are prone to repeating words, omitting words and alignment errors (i.e., hallucination or attention errors). These problems make LLM - based TTS models unreliable in practical applications. Specifically, the author points out that when training the encoder - decoder Transformer model for predicting speech tokens, some cross - attention heads will implicitly learn the alignment between text and speech. However, because this alignment is unconstrained during the training process, misaligned synthesis may occur during inference. To improve the robustness of the model, the author proposes a technique using CTC loss and attention priors to encourage monotonic cross - attention to align text tokens. This method does not need to introduce new learnable parameters and significantly improves the robustness of LLM - based TTS models. ### Key Problem Summary 1. **Hallucination and Attention Errors**: - LLM - based TTS models may experience hallucination (such as repeating words, omitting words) when generating speech, especially when dealing with texts containing repetitive words. 2. **Alignment Problems**: - Some cross - attention heads in the encoder - decoder Transformer model will implicitly learn the alignment between text and speech, but due to the lack of constraints, this alignment may be unstable during inference. 3. **Improving Robustness**: - The author proposes a method using CTC loss and attention priors to encourage monotonic cross - attention to align text tokens, thereby improving the robustness of the model. ### Solutions - **CTC Loss**: Calculate the Connectionist Temporal Classification (CTC) loss to encourage effective monotonic alignment. - **Attention Priors**: Use a static 2D beta - binomial prior matrix to guide the attention mechanism to be more inclined to monotonic alignment. These methods work together to make the TTS model more robust when dealing with complex texts and reduce the occurrence of hallucination and attention errors.

Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments

One TTS Alignment To Rule Them All

Enhancing Monotonicity for Robust Autoregressive Transformer TTS

Improving the Robustness of Large Language Models via Consistency Alignment

Boosting Large Language Model for Speech Synthesis: An Empirical Study

VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis

Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations

Improving Joint Speech-Text Representations Without Alignment

Speaking style adaptation in Text-To-Speech synthesis using Sequence-to-sequence models with attention

Improving Autoregressive NLP Tasks via Modular Linearized Attention

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR

Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis

Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets

Weak Alignment Supervision from Hybrid Model Improves End-to-end ASR

AlignFormer: Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM

MoBoAligner: A Neural Alignment Model for Non-Autoregressive TTS with Monotonic Boundary Search.

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition