Abstract:In this paper, we present a novel architecture to realize fine-grained style control on the transformer-based text-to-speech synthesis (TransformerTTS). Specifically, we model the speaking style by extracting a time sequence of local style tokens (LST) from the reference speech. The existing content encoder in TransformerTTS is then replaced by our designed cross-attention blocks for fusion and alignment between content and style. As the fusion is performed along with the skip connection, our cross-attention block provides a good inductive bias to gradually infuse the phoneme representation with a given style. Additionally, we prevent the style embedding from encoding linguistic content by randomly truncating LST during training and using wav2vec 2.0 features. Experiments show that with fine-grained style control, our system performs better in terms of naturalness, intelligibility, and style transferability. Our code and samples are publicly available.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to achieve fine - grained style control in the Transformer - based text - to - speech synthesis system (TransformerTTS). Specifically, the author aims to improve the deficiencies of the existing system by introducing a new architecture, so that the generated speech is not only natural and clear, but also can better convey the speaking style and avoid the content - leakage problem. ### Main Problem Analysis 1. **Content - leakage Problem**: - In the existing TTS systems, the model may incorrectly extract language information from the reference speech, resulting in a mismatch between the generated speech and the content of the input text. This phenomenon is called "content - leakage". To overcome this problem, this paper proposes a new method to separate and control the speaking style without encoding the language content. 2. **Fine - grained Style Control**: - The existing style control methods can usually only handle global styles or style embeddings of fixed lengths, and are unable to well capture and control the time - varying characteristics within sentences (such as speech rate, volume and prosody). This paper proposes a method based on Local Style Tokens (LST) to achieve more refined style control. 3. **Style Transfer in Multi - speaker Settings**: - In addition to the single - speaker scenario, this paper also explores the style transfer ability in multi - speaker settings, ensuring that the system can effectively transfer styles between different speakers. ### Solutions - **Local Style Tokens (LST)**: - Extract the time - series local style tokens (LST) in the reference speech, and fuse these style information with the text content through the cross - attention mechanism. The design of LST enables the model to capture more subtle style changes. - **Cross - attention Module**: - Replace the original content encoder and use cross - attention blocks to gradually incorporate the given style into the phoneme representation. This method provides better inductive bias and helps to gradually inject style information. - **Training Tricks**: - During the training process, randomly truncate LST to prevent the model from relying too much on the complete reference speech, thereby alleviating the content - leakage problem. In addition, use wav2vec 2.0 features to extract audio representations, further reducing the risk of content - leakage. Through these improvements, the system proposed in this paper performs excellently in terms of naturalness, comprehensibility and style transfer ability, and can effectively avoid the content - leakage problem.

Fine-grained style control in Transformer-based Text-to-speech Synthesis

StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech

Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis

Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations

StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis

Text-aware and Context-aware Expressive Audiobook Speech Synthesis

Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Towards Multi-Scale Style Control for Expressive Speech Synthesis

Spoken Style Learning with Multi-modal Hierarchical Context Encoding for Conversational Text-to-Speech Synthesis.

Controllable Accented Text-to-Speech Synthesis with Fine and Coarse-Grained Intensity Rendering

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions

Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS

DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles

Style Mixture of Experts for Expressive Text-To-Speech Synthesis

Exploring synthetic data for cross-speaker style transfer in style representation based TTS

Interactive Text-to-Speech via Semi-supervised Style Transfer Learning