Abstract:Existing zero-shot text-to-speech (TTS) systems are typically designed to process complete sentences and are constrained by the maximum duration for which they have been trained. However, in many streaming applications, texts arrive continuously in short chunks, necessitating instant responses from the system. We identify the essential capabilities required for chunk-level streaming and introduce LiveSpeech 2, a stream-aware model that supports infinitely long speech generation, text-audio stream synchronization, and seamless transitions between short speech chunks. To achieve these, we propose (1) adopting Mamba, a class of sequence modeling distinguished by linear-time decoding, which is augmented by cross-attention mechanisms for conditioning, (2) utilizing rotary positional embeddings in the computation of cross-attention, enabling the model to process an infinite text stream by sliding a window, and (3) decoding with semantic guidance, a technique that aligns speech with the transcript during inference with minimal overhead. Experimental results demonstrate that our models are competitive with state-of-the-art language model-based zero-shot TTS models, while also providing flexibility to support a wide range of streaming scenarios.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges faced by existing zero - shot text - to - speech (TTS) systems when dealing with continuous text streams. Specifically, existing systems are usually designed to handle complete sentences and are limited by the maximum duration of their training. However, in many streaming application scenarios, text arrives continuously in short fragments, requiring the system to respond immediately. ### Main problems 1. **Handling infinitely long speech generation**: Existing systems have difficulty handling infinitely long text streams. 2. **Text - audio stream synchronization**: Ensure that the generated speech remains synchronized with the input text stream. 3. **Seamless transitions between short speech fragments**: Ensure a smooth transition between the generated speech corresponding to different short text fragments and avoid abrupt connections. ### Solutions To solve the above problems, the author introduced LiveSpeech 2, a model that supports streaming processing and has the following characteristics: 1. **Adopting the Mamba architecture**: This is a sequence - modeling method for linear - time decoding, conditioned by a cross - attention mechanism. 2. **Using rotary position embedding**: Use rotary position embedding when calculating cross - attention, enabling the model to handle infinite text streams through a sliding window. 3. **Semantic - guided decoding**: Use semantic - guided alignment of speech and text transcripts during the inference process to minimize additional overhead. ### Experimental results The experimental results show that LiveSpeech 2 is comparable in performance to the state - of - the - art language - model - based zero - sample TTS models while providing broader support for streaming scenarios. ### Formula representation To ensure the correctness and readability of the formulas, the following are some of the formulas involved in the paper represented in Markdown format: #### Cross - attention calculation \[ a_t=\text{Softmax}\left(\frac{q_tK_{\text{enr}}^T}{\sqrt{d_k}}V_{\text{enr}}+\text{RoPE}(q_t,t)\text{RoPE}(K_{\text{txt}}^T,T_t)\frac{V_{\text{txt}}}{\sqrt{d_k}}\right) \] where $\text{RoPE}(K,T)$ represents rotating the key matrix $K$ given the index $T$. #### Semantic - guided decoding During the inference process, determine the guiding token set $T_{\text{guiding}}$ and the top - $k$ token set $T_{\text{top - k}}$ with the highest probability based on the current grapheme word sequence and transcript. The next grapheme word is sampled from $T_{\text{guiding}}\cup T_{\text{top - k}}$, and the probabilities are re - weighted and normalized: \[ \tilde{p}[T_{\text{guiding}}]=\tilde{p}[T_{\text{guiding}}]\times(1 + \lambda) \] \[ g_t\sim\text{TopKSampling}(\tilde{p},k) \] Through these improvements, LiveSpeech 2 can reliably handle streaming text - to - speech conversion tasks with low latency.

Zero-Shot Text-to-Speech from Continuous Text Streams

LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion

A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion

MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

Intelli-Z: Toward Intelligible Zero-Shot TTS

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

FlashSpeech: Efficient Zero-Shot Speech Synthesis

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models