Abstract:The style transfer task in Text-to-Speech refers to the process of transferring style information into text content to generate corresponding speech with a specific style. However, most existing style transfer approaches are either based on fixed emotional labels or reference speech clips, which cannot achieve flexible style transfer. Recently, some methods have adopted text descriptions to guide style transfer. In this paper, we propose a more flexible multi-modal and style controllable TTS framework named MM-TTS. It can utilize any modality as the prompt in unified multi-modal prompt space, including reference speech, emotional facial images, and text descriptions, to control the style of the generated speech in a system. The challenges of modeling such a multi-modal style controllable TTS mainly lie in two aspects:1)aligning the multi-modal information into a unified style space to enable the input of arbitrary modality as the style prompt in a single system, and 2)efficiently transferring the unified style representation into the given text content, thereby empowering the ability to generate prompt style-related voice. To address these problems, we propose an aligned multi-modal prompt encoder that embeds different modalities into a unified style space, supporting style transfer for different modalities. Additionally, we present a new adaptive style transfer method named Style Adaptive Convolutions to achieve a better style representation. Furthermore, we design a Rectified Flow based Refiner to solve the problem of over-smoothing Mel-spectrogram and generate audio of higher fidelity. Since there is no public dataset for multi-modal TTS, we construct a dataset named MEAD-TTS, which is related to the field of expressive talking head. Our experiments on the MEAD-TTS dataset and out-of-domain datasets demonstrate that MM-TTS can achieve satisfactory results based on multi-modal prompts.

Style Mixture of Experts for Expressive Text-To-Speech Synthesis

Style Mixture of Experts for Expressive Text-To-Speech Synthesis

UATST: Towards Unpaired Arbitrary Text-Guided Style Transfer with Cross-Space Modulation

MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis

Memory-enhanced text style transfer with dynamic style learning and calibration

Interactive Text-to-Speech via Semi-supervised Style Transfer Learning

Disentangling Style and Speaker Attributes for TTS Style Transfer

Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations

Spoken Style Learning with Multi-modal Hierarchical Context Encoding for Conversational Text-to-Speech Synthesis.

M3TTS: Multi-modal Text-to-speech of Multi-Scale Style Control for Dubbing

Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach

Exploring synthetic data for cross-speaker style transfer in style representation based TTS

Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Towards Multi-Scale Style Control for Expressive Speech Synthesis

DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles

Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis

Rapid Style Adaptation Using Residual Error Embedding for Expressive Speech Synthesis.