StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis

Zhiyong Chen,Xinnuo Li,Zhiqi Ai,Shugong Xu
2024-09-24
Abstract:We introduce StyleFusion-TTS, a prompt and/or audio referenced, style and speaker-controllable, zero-shot text-to-speech (TTS) synthesis system designed to enhance the editability and naturalness of current research literature. We propose a general front-end encoder as a compact and effective module to utilize multimodal inputs including text prompts, audio references, and speaker timbre references in a fully zero-shot manner and produce disentangled style and speaker control embeddings. Our novel approach also leverages a hierarchical conformer structure for the fusion of style and speaker control embeddings, aiming to achieve optimal feature fusion within the current advanced TTS architecture. StyleFusion-TTS is evaluated through multiple metrics, both subjectively and objectively. The system shows promising performance across our evaluations, suggesting its potential to contribute to the advancement of the field of zero-shot text-to-speech synthesis.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper introduces a new system called **StyleFusion-TTS**, which aims to address several key challenges in Zero-shot Text-to-Speech (ZS-TTS). Specifically, the system addresses the following major issues: 1. **Accurate Imitation of Pronunciation Features**: In zero-shot scenarios, accurately replicating the timbre of the reference speaker and allowing users to customize voice styles (such as emotion, accent, etc.). 2. **Style Control**: Achieving high control over voice styles, including emotion, speed, volume, etc., while maintaining high editability. 3. **Multimodal Input Fusion**: Utilizing three input modes—text prompts from natural conversations, style reference audio, and speaker reference audio—to achieve precise control over style and speaker identity. ### Main Contributions 1. **General Front-end Encoder**: A compact front-end encoder (General Style Fusion Encoder, GSF-enc) is proposed to encode and decouple control embeddings of speaker identity and emotional style, improving the decoupling capability of speaker and style modeling. 2. **Hierarchical Fusion Module**: A hierarchical conformer two-branch style control module (Hierarchical Conformer Two-Branch Style Control Module, HC-TSCM) is introduced to ensure effective feature fusion in zero-shot TTS. 3. **Enhanced TTS Architecture**: The StyleFusion-TTS system is proposed, improving existing TTS architectures to generate controllable and natural speech. Through various subjective and objective evaluation metrics, the system demonstrates good performance and has the potential for further development in the field of zero-shot text-to-speech synthesis.