Zero-Shot Text-to-Speech from Continuous Text Streams

Trung Dang,David Aponte,Dung Tran,Tianyi Chen,Kazuhito Koishida
2024-10-01
Abstract:Existing zero-shot text-to-speech (TTS) systems are typically designed to process complete sentences and are constrained by the maximum duration for which they have been trained. However, in many streaming applications, texts arrive continuously in short chunks, necessitating instant responses from the system. We identify the essential capabilities required for chunk-level streaming and introduce LiveSpeech 2, a stream-aware model that supports infinitely long speech generation, text-audio stream synchronization, and seamless transitions between short speech chunks. To achieve these, we propose (1) adopting Mamba, a class of sequence modeling distinguished by linear-time decoding, which is augmented by cross-attention mechanisms for conditioning, (2) utilizing rotary positional embeddings in the computation of cross-attention, enabling the model to process an infinite text stream by sliding a window, and (3) decoding with semantic guidance, a technique that aligns speech with the transcript during inference with minimal overhead. Experimental results demonstrate that our models are competitive with state-of-the-art language model-based zero-shot TTS models, while also providing flexibility to support a wide range of streaming scenarios.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced by existing zero - shot text - to - speech (TTS) systems when dealing with continuous text streams. Specifically, existing systems are usually designed to handle complete sentences and are limited by the maximum duration of their training. However, in many streaming application scenarios, text arrives continuously in short fragments, requiring the system to respond immediately. ### Main problems 1. **Handling infinitely long speech generation**: Existing systems have difficulty handling infinitely long text streams. 2. **Text - audio stream synchronization**: Ensure that the generated speech remains synchronized with the input text stream. 3. **Seamless transitions between short speech fragments**: Ensure a smooth transition between the generated speech corresponding to different short text fragments and avoid abrupt connections. ### Solutions To solve the above problems, the author introduced LiveSpeech 2, a model that supports streaming processing and has the following characteristics: 1. **Adopting the Mamba architecture**: This is a sequence - modeling method for linear - time decoding, conditioned by a cross - attention mechanism. 2. **Using rotary position embedding**: Use rotary position embedding when calculating cross - attention, enabling the model to handle infinite text streams through a sliding window. 3. **Semantic - guided decoding**: Use semantic - guided alignment of speech and text transcripts during the inference process to minimize additional overhead. ### Experimental results The experimental results show that LiveSpeech 2 is comparable in performance to the state - of - the - art language - model - based zero - sample TTS models while providing broader support for streaming scenarios. ### Formula representation To ensure the correctness and readability of the formulas, the following are some of the formulas involved in the paper represented in Markdown format: #### Cross - attention calculation \[ a_t=\text{Softmax}\left(\frac{q_tK_{\text{enr}}^T}{\sqrt{d_k}}V_{\text{enr}}+\text{RoPE}(q_t,t)\text{RoPE}(K_{\text{txt}}^T,T_t)\frac{V_{\text{txt}}}{\sqrt{d_k}}\right) \] where $\text{RoPE}(K,T)$ represents rotating the key matrix $K$ given the index $T$. #### Semantic - guided decoding During the inference process, determine the guiding token set $T_{\text{guiding}}$ and the top - $k$ token set $T_{\text{top - k}}$ with the highest probability based on the current grapheme word sequence and transcript. The next grapheme word is sampled from $T_{\text{guiding}}\cup T_{\text{top - k}}$, and the probabilities are re - weighted and normalized: \[ \tilde{p}[T_{\text{guiding}}]=\tilde{p}[T_{\text{guiding}}]\times(1 + \lambda) \] \[ g_t\sim\text{TopKSampling}(\tilde{p},k) \] Through these improvements, LiveSpeech 2 can reliably handle streaming text - to - speech conversion tasks with low latency.