Abstract:The style transfer task in Text-to-Speech refers to the process of transferring style information into text content to generate corresponding speech with a specific style. However, most existing style transfer approaches are either based on fixed emotional labels or reference speech clips, which cannot achieve flexible style transfer. Recently, some methods have adopted text descriptions to guide style transfer. In this paper, we propose a more flexible multi-modal and style controllable TTS framework named MM-TTS. It can utilize any modality as the prompt in unified multi-modal prompt space, including reference speech, emotional facial images, and text descriptions, to control the style of the generated speech in a system. The challenges of modeling such a multi-modal style controllable TTS mainly lie in two aspects:1)aligning the multi-modal information into a unified style space to enable the input of arbitrary modality as the style prompt in a single system, and 2)efficiently transferring the unified style representation into the given text content, thereby empowering the ability to generate prompt style-related voice. To address these problems, we propose an aligned multi-modal prompt encoder that embeds different modalities into a unified style space, supporting style transfer for different modalities. Additionally, we present a new adaptive style transfer method named Style Adaptive Convolutions to achieve a better style representation. Furthermore, we design a Rectified Flow based Refiner to solve the problem of over-smoothing Mel-spectrogram and generate audio of higher fidelity. Since there is no public dataset for multi-modal TTS, we construct a dataset named MEAD-TTS, which is related to the field of expressive talking head. Our experiments on the MEAD-TTS dataset and out-of-domain datasets demonstrate that MM-TTS can achieve satisfactory results based on multi-modal prompts.

Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis

Toward Synthesizing Expressive Mandarin Speech

StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech

Text-aware and Context-aware Expressive Audiobook Speech Synthesis

Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations

Adaptive Text to Speech for Spontaneous Style

MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis

AS-Speech: Adaptive Style for Speech Synthesis

MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis

Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models

Fine-grained style control in Transformer-based Text-to-speech Synthesis

TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions

EE-TTS: Emphatic Expressive TTS with Linguistic Information