Abstract:Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-Speech System (MM-TTS), a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. MM-TTS consists of two key components: (1) the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information; and (2) the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations across diverse datasets demonstrate the superior performance of MM-TTS compared to traditional E-TTS models. Objective metrics, including Word Error Rate (WER) and Character Error Rate (CER), show significant improvements on ESD dataset, with MM-TTS achieving scores of 7.35% and 3.07%, respectively. Subjective assessments further validate that MM-TTS generates speech with emotional fidelity and naturalness comparable to human speech. Our code and pre-trained models are publicly available at https://anonymous.4open.science/r/MMTTS-D214

Detection and Emphatic Realization of Contrastive Word Pairs for Expressive Text-to-speech Synthesis

Automatic detection of contrastive word pairs using textual and acoustic features

Synthesizing English Emphatic Speech for Multimodal Corrective Feedback in Computer-Aided Pronunciation Training.

Generating emphatic speech with hidden Markov model for expressive speech synthesis

Emphasis Detection for Voice Dialogue Applications Using Multi-channel Convolutional Bidirectional Long Short-Term Memory Network

EE-TTS: Emphatic Expressive TTS with Linguistic Information

HMM-based Emphatic Speech Synthesis for Corrective Feedback in Computer-Aided Pronunciation Training

Learning Contextual Representation with Convolution Bank and Multi-head Self-attention for Speech Emphasis Detection.

Generating Emphasis from Neutral Speech Using Hierarchical Perturbation Model by Decision Tree and Support Vector Machine

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Modeling Prosody Patterns for Chinese Expressive Text-to-speech Synthesis

MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling

Hierarchical English Emphatic Speech Synthesis Based on HMM with Limited Training Data.

Controllable Emphatic Speech Synthesis Based on Forward Attention for Expressive Speech Synthesis

CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis

Inferring Emphasis for Real Voice Data: an Attentive Multimodal Neural Network Approach.

Emphatic Speech Synthesis and Control Based on Characteristic Transferring in End-to-End Speech Synthesis

Emphatic Speech Generation with Conditioned Input Layer and Bidirectional LSTMS for Expressive Speech Synthesis.

Objective Evaluation Methods for Chinese Text-To-Speech Systems

MODELLING THE GLOBAL ACOUSTIC CORRELATES OF EXPRESSIVITY FOR CHINESE TEXT-TO-SPEECH SYNTHESIS