Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis

Théodor Lemerle,Nicolas Obin,Axel Roebel
2024-06-11
Abstract:Recent advancements in text-to-speech (TTS) powered by language models have showcased remarkable capabilities in achieving naturalness and zero-shot voice cloning. Notably, the decoder-only transformer is the prominent architecture in this domain. However, transformers face challenges stemming from their quadratic complexity in sequence length, impeding training on lengthy sequences and resource-constrained hardware. Moreover they lack specific inductive bias with regards to the monotonic nature of TTS alignments. In response, we propose to replace transformers with emerging recurrent architectures and introduce specialized cross-attention mechanisms for reducing repeating and skipping issues. Consequently our architecture can be efficiently trained on long samples and achieve state-of-the-art zero-shot voice cloning against baselines of comparable size. Our implementation and demos are available at <a class="link-external link-https" href="https://github.com/theodorblackbird/lina-speech" rel="external noopener nofollow">this https URL</a>.
Audio and Speech Processing,Computation and Language,Sound
What problem does this paper attempt to address?
The paper attempts to address the challenges faced by current Transformer-based Text-to-Speech (TTS) models when handling long sequences, particularly the issues of low training efficiency and high resource consumption due to their quadratic complexity. Additionally, existing Transformer models lack specific inductive biases for the monotonic alignment properties in TTS tasks, leading to problems of repetition and skipping. To tackle these challenges, the paper proposes a new architecture called Small-E, which uses a Linear Attention mechanism to replace the traditional self-attention mechanism and introduces a Position-Aware Cross-Attention (PACA) mechanism to reduce repetition and skipping issues. Specifically, the main contributions of the paper include: 1. **Introduction of Linear Causal Language Model (LCLM) blocks**: Replacing the self-attention mechanism with a "time-mixing" mechanism of linear complexity to improve training efficiency, especially for long sequences. 2. **Position-Aware Cross-Attention (PACA) mechanism**: Specifically designed for TTS tasks, helping to address repetition and skipping issues in autoregressive models. 3. **Zero-shot voice cloning**: Efficient training on resource-constrained hardware, achieving zero-shot voice cloning performance comparable to existing models of the same scale. With these improvements, Small-E not only enhances training efficiency but also achieves significant results in the naturalness and similarity of generated speech.