Abstract:Expressive text-to-speech (E-TTS) synthesis is important for enhancing user experience in communication with machines using the speech modality. However, one of the challenges in E-TTS is the lack of a precise description of emotions. Previous categorical specifications may be insufficient for describing complex emotions. The dimensional specifications face the difficulty of ambiguity in annotation. This work advocates a new approach of describing emotive speech acoustics using spoken exemplars. We investigate methods to extract emotion descriptions from the input exemplar of emotive speech. The measures are combined to form two descriptors, based on capsule network (CapNet) and residual error network (RENet). The first is designed to consider the spatial information in the input exemplary spectrogram, and the latter is to capture the contrastive information between emotive acoustic expressions. Two different approaches are applied for conversion from the variable-length feature sequence to fixed-size description vector: (1) dynamic routing groups similar capsules to the output description; and (2) recurrent neural network's hidden states store the temporal information for the description. The two descriptors are integrated to a state-of-the-art sequence-to-sequence architecture to obtain an end-to-end architecture that is optimized as a whole towards the same goal of generating correct emotive speech. Experimental results on a public audiobook dataset demonstrate that the two exemplar-based approaches achieve significant performance improvement over the baseline system in both emotion similarity and speech quality.

EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models

Synthesizing English Emphatic Speech for Multimodal Corrective Feedback in Computer-Aided Pronunciation Training.

EE-TTS: Emphatic Expressive TTS with Linguistic Information

EMPHASIS: An Emotional Phoneme-based Acoustic Model for Speech Synthesis System

Detection and Emphatic Realization of Contrastive Word Pairs for Expressive Text-to-speech Synthesis

The Perceptimatic English Benchmark for Speech Perception Models

Generating emphatic speech with hidden Markov model for expressive speech synthesis

Word-Level Emphasis Modelling in Hmm-Based Speech Synthesis

Emphatic Speech Synthesis and Control Based on Characteristic Transferring in End-to-End Speech Synthesis

Emphasis Detection for Voice Dialogue Applications Using Multi-channel Convolutional Bidirectional Long Short-Term Memory Network

ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech

Controllable Emphasis with zero data for text-to-speech

Spoken English Assessment System for Non-Native Speakers Using Acoustic and Prosodic Features.

Exemplar-Based Emotive Speech Synthesis

A Holistic Cascade System, benchmark, and Human Evaluation Protocol for Expressive Speech-to-Speech Translation

Learning Multilingual Expressive Speech Representation for Prosody Prediction without Parallel Data

MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

SpeechAlign: a Framework for Speech Translation Alignment Evaluation

Generating Emphasis from Neutral Speech Using Hierarchical Perturbation Model by Decision Tree and Support Vector Machine

HMM-based Emphatic Speech Synthesis for Corrective Feedback in Computer-Aided Pronunciation Training