Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis

Théodor Lemerle,Nicolas Obin,Axel Roebel

2024-06-11

Abstract:Recent advancements in text-to-speech (TTS) powered by language models have showcased remarkable capabilities in achieving naturalness and zero-shot voice cloning. Notably, the decoder-only transformer is the prominent architecture in this domain. However, transformers face challenges stemming from their quadratic complexity in sequence length, impeding training on lengthy sequences and resource-constrained hardware. Moreover they lack specific inductive bias with regards to the monotonic nature of TTS alignments. In response, we propose to replace transformers with emerging recurrent architectures and introduce specialized cross-attention mechanisms for reducing repeating and skipping issues. Consequently our architecture can be efficiently trained on long samples and achieve state-of-the-art zero-shot voice cloning against baselines of comparable size. Our implementation and demos are available at <a class="link-external link-https" href="https://github.com/theodorblackbird/lina-speech" rel="external noopener nofollow">this https URL</a>.

Audio and Speech Processing,Computation and Language,Sound

What problem does this paper attempt to address?

The paper attempts to address the challenges faced by current Transformer-based Text-to-Speech (TTS) models when handling long sequences, particularly the issues of low training efficiency and high resource consumption due to their quadratic complexity. Additionally, existing Transformer models lack specific inductive biases for the monotonic alignment properties in TTS tasks, leading to problems of repetition and skipping. To tackle these challenges, the paper proposes a new architecture called Small-E, which uses a Linear Attention mechanism to replace the traditional self-attention mechanism and introduces a Position-Aware Cross-Attention (PACA) mechanism to reduce repetition and skipping issues. Specifically, the main contributions of the paper include: 1. **Introduction of Linear Causal Language Model (LCLM) blocks**: Replacing the self-attention mechanism with a "time-mixing" mechanism of linear complexity to improve training efficiency, especially for long sequences. 2. **Position-Aware Cross-Attention (PACA) mechanism**: Specifically designed for TTS tasks, helping to address repetition and skipping issues in autoregressive models. 3. **Zero-shot voice cloning**: Efficient training on resource-constrained hardware, achieving zero-shot voice cloning performance comparable to existing models of the same scale. With these improvements, Small-E not only enhances training efficiency but also achieves significant results in the naturalness and similarity of generated speech.

Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis

Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis

Efficient Decoding Self-Attention for End-to-end Speech Synthesis

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

Improving Autoregressive NLP Tasks via Modular Linearized Attention

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

Lightspeech: Lightweight and Fast Text to Speech with Neural Architecture Search

EfficientSpeech: An On-Device Text to Speech Model

HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis

Small Language Models: Survey, Measurements, and Insights

EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

Cross-lingual Low Resource Speaker Adaptation Using Phonological Features

RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations

CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech