Abstract:While current emotional text-to-speech (TTS) systems can generate highly intelligible emotional speech, achieving fine control over emotion rendering of the output speech still remains a significant challenge. In this paper, we introduce ParaEVITS, a novel emotional TTS framework that leverages the compositionality of natural language to enhance control over emotional rendering. By incorporating a text-audio encoder inspired by ParaCLAP, a contrastive language-audio pretraining (CLAP) model for computational paralinguistics, the diffusion model is trained to generate emotional embeddings based on textual emotional style descriptions. Our framework first trains on reference audio using the audio encoder, then fine-tunes a diffusion model to process textual inputs from ParaCLAP's text encoder. During inference, speech attributes such as pitch, jitter, and loudness are manipulated using only textual conditioning. Our experiments demonstrate that ParaEVITS effectively control emotion rendering without compromising speech quality. Speech demos are publicly available.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the lack of fine - control ability of emotional rendering in current Emotional Text - to - Speech (TTS) systems when generating high - quality emotional voices. Specifically: 1. **Limitations of Existing Systems**: - Although existing emotional TTS systems can generate highly understandable emotional voices, they still face challenges in finely controlling the emotional expression of the output voices. - Traditional methods rely on predefined emotional labels or reference voices. The former leads to stereotyped emotional patterns, and the latter makes it difficult to select appropriate reference voices, limiting the flexibility of application. 2. **Research Objectives**: - Propose a new framework, ParaEVITS, which uses natural language guidance to enhance the control of emotional rendering. - Through contrastive learning and diffusion models, generate emotional embeddings based on text descriptions, thereby more flexibly controlling the emotional attributes of synthetic voices. 3. **Innovations**: - **ParaCLAP - NP**: Combine Computational Paralinguistics (CP), use contrastive learning to associate low - level acoustic features with high - level emotional descriptions, and generate more detailed emotional representations. - **Diffusion Model**: Generate natural - language emotional embeddings through the diffusion model to guide the TTS framework to generate emotional voices that match the text prompts. - **Multimodal Fusion**: Utilize the synergy between audio encoders and text encoders to achieve efficient conversion from text to emotional voices. 4. **Experimental Verification**: - Evaluate the system performance through subjective and objective experiments. The results show that ParaEVITS can effectively control emotional rendering while maintaining voice quality, and has diverse performance in different emotional categories. In summary, this paper aims to solve the deficiencies of existing emotional TTS systems in emotional - rendering control by introducing the ParaEVITS framework, and provide a more flexible and controllable method for generating emotional voices.

Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models

Hierarchical Control of Emotion Rendering in Speech Synthesis

EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities

EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance

DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin

Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions

Toward Any-to-Any Emotion Voice Conversion using Disentangled Diffusion Framework

Emotional Audio-Visual Speech Synthesis Based on PAD

Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity

Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech

EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control

MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

Emo-Tts:Parallel Transformer-based Text-to-Speech Model with Emotional Awareness

A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech -- a Deep Learning approach

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

Text2FX: Harnessing CLAP Embeddings for Text-Guided Audio Effects

Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability

Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation

Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

PAVITS: Exploring Prosody-aware VITS for End-to-End Emotional Voice Conversion