StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis

Zhiyong Chen,Xinnuo Li,Zhiqi Ai,Shugong Xu

2024-09-24

Abstract:We introduce StyleFusion-TTS, a prompt and/or audio referenced, style and speaker-controllable, zero-shot text-to-speech (TTS) synthesis system designed to enhance the editability and naturalness of current research literature. We propose a general front-end encoder as a compact and effective module to utilize multimodal inputs including text prompts, audio references, and speaker timbre references in a fully zero-shot manner and produce disentangled style and speaker control embeddings. Our novel approach also leverages a hierarchical conformer structure for the fusion of style and speaker control embeddings, aiming to achieve optimal feature fusion within the current advanced TTS architecture. StyleFusion-TTS is evaluated through multiple metrics, both subjectively and objectively. The system shows promising performance across our evaluations, suggesting its potential to contribute to the advancement of the field of zero-shot text-to-speech synthesis.

Audio and Speech Processing,Sound

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper introduces a new system called **StyleFusion-TTS**, which aims to address several key challenges in Zero-shot Text-to-Speech (ZS-TTS). Specifically, the system addresses the following major issues: 1. **Accurate Imitation of Pronunciation Features**: In zero-shot scenarios, accurately replicating the timbre of the reference speaker and allowing users to customize voice styles (such as emotion, accent, etc.). 2. **Style Control**: Achieving high control over voice styles, including emotion, speed, volume, etc., while maintaining high editability. 3. **Multimodal Input Fusion**: Utilizing three input modes—text prompts from natural conversations, style reference audio, and speaker reference audio—to achieve precise control over style and speaker identity. ### Main Contributions 1. **General Front-end Encoder**: A compact front-end encoder (General Style Fusion Encoder, GSF-enc) is proposed to encode and decouple control embeddings of speaker identity and emotional style, improving the decoupling capability of speaker and style modeling. 2. **Hierarchical Fusion Module**: A hierarchical conformer two-branch style control module (Hierarchical Conformer Two-Branch Style Control Module, HC-TSCM) is introduced to ensure effective feature fusion in zero-shot TTS. 3. **Enhanced TTS Architecture**: The StyleFusion-TTS system is proposed, improving existing TTS architectures to generate controllable and natural speech. Through various subjective and objective evaluation metrics, the system demonstrates good performance and has the potential for further development in the field of zero-shot text-to-speech synthesis.

StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations

MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis

PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions

StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech

Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Fine-grained style control in Transformer-based Text-to-speech Synthesis

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control

CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis