Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models

Xin Jing,Kun Zhou,Andreas Triantafyllopoulos,Björn W. Schuller
2024-09-10
Abstract:While current emotional text-to-speech (TTS) systems can generate highly intelligible emotional speech, achieving fine control over emotion rendering of the output speech still remains a significant challenge. In this paper, we introduce ParaEVITS, a novel emotional TTS framework that leverages the compositionality of natural language to enhance control over emotional rendering. By incorporating a text-audio encoder inspired by ParaCLAP, a contrastive language-audio pretraining (CLAP) model for computational paralinguistics, the diffusion model is trained to generate emotional embeddings based on textual emotional style descriptions. Our framework first trains on reference audio using the audio encoder, then fine-tunes a diffusion model to process textual inputs from ParaCLAP's text encoder. During inference, speech attributes such as pitch, jitter, and loudness are manipulated using only textual conditioning. Our experiments demonstrate that ParaEVITS effectively control emotion rendering without compromising speech quality. Speech demos are publicly available.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the lack of fine - control ability of emotional rendering in current Emotional Text - to - Speech (TTS) systems when generating high - quality emotional voices. Specifically: 1. **Limitations of Existing Systems**: - Although existing emotional TTS systems can generate highly understandable emotional voices, they still face challenges in finely controlling the emotional expression of the output voices. - Traditional methods rely on predefined emotional labels or reference voices. The former leads to stereotyped emotional patterns, and the latter makes it difficult to select appropriate reference voices, limiting the flexibility of application. 2. **Research Objectives**: - Propose a new framework, ParaEVITS, which uses natural language guidance to enhance the control of emotional rendering. - Through contrastive learning and diffusion models, generate emotional embeddings based on text descriptions, thereby more flexibly controlling the emotional attributes of synthetic voices. 3. **Innovations**: - **ParaCLAP - NP**: Combine Computational Paralinguistics (CP), use contrastive learning to associate low - level acoustic features with high - level emotional descriptions, and generate more detailed emotional representations. - **Diffusion Model**: Generate natural - language emotional embeddings through the diffusion model to guide the TTS framework to generate emotional voices that match the text prompts. - **Multimodal Fusion**: Utilize the synergy between audio encoders and text encoders to achieve efficient conversion from text to emotional voices. 4. **Experimental Verification**: - Evaluate the system performance through subjective and objective experiments. The results show that ParaEVITS can effectively control emotional rendering while maintaining voice quality, and has diverse performance in different emotional categories. In summary, this paper aims to solve the deficiencies of existing emotional TTS systems in emotional - rendering control by introducing the ParaEVITS framework, and provide a more flexible and controllable method for generating emotional voices.