Abstract:Expressive speech synthesis has received increased attention in recent times. Stress (or pitch accent) is the perceptual prominence within words or utterances, which contributes to the expressivity of speech. This paper summarizes our contribution to Mandarin expressive speech synthesis. A novel hierarchical stress modeling and generation method for Mandarin is proposed and further integrated into HMM-based speech synthesis (HTS) and Fujisaki model-based speech synthesis systems to accurately model the undulation of pitch contour. In HMM-based expressive speech synthesis, stress-related contextual features obtained from the hierarchical model are introduced in modeling the prosodic variation caused by stress, in addition to the traditional prosodic features used in HTS. A rule-based and a Deep Belief Network based prosodic variation models are proposed and then used in stress adaptation module in HTS. The other approach uses the Fujisaki model to improve the expressiveness of synthetic speech. The hierarchical stress model is introduced into the phrase and tone command control mechanisms of the model. The pitch contour is then directly generated by the superposition of two-level commands of the Fujisaki model. Experimental results using the proposed hierarchical stress modeling and generation methods showed that the macro- and microcharacteristics of stress could be successfully captured. The methodology proposed in this paper has application to a range of areas such as conveying attitude and indicating focus in spoken dialog systems. (C) 2015 Elsevier B.V. All rights reserved.

Hierarchical English Emphatic Speech Synthesis Based on HMM with Limited Training Data.

Generating emphatic speech with hidden Markov model for expressive speech synthesis

Synthesizing English Emphatic Speech for Multimodal Corrective Feedback in Computer-Aided Pronunciation Training.

HMM-based Emphatic Speech Synthesis for Corrective Feedback in Computer-Aided Pronunciation Training

Generating Emphasis from Neutral Speech Using Hierarchical Perturbation Model by Decision Tree and Support Vector Machine

Controllable Emphatic Speech Synthesis Based on Forward Attention for Expressive Speech Synthesis

Emphatic Speech Generation with Conditioned Input Layer and Bidirectional LSTMS for Expressive Speech Synthesis.

EMPHASIS: An Emotional Phoneme-based Acoustic Model for Speech Synthesis System

Emphatic Speech Synthesis and Control Based on Characteristic Transferring in End-to-End Speech Synthesis

EE-TTS: Emphatic Expressive TTS with Linguistic Information

Learning Cross-Lingual Knowledge With Multilingual Blstm For Emphasis Detection With Limited Training Data

Hierarchical Stress Modeling and Generation in Mandarin for Expressive Text-to-Speech

A Hierarchical F0 Modeling Method for HMM-based Speech Synthesis

Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling

Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis.

Detection and Emphatic Realization of Contrastive Word Pairs for Expressive Text-to-speech Synthesis

Inferring Emphasis for Real Voice Data: an Attentive Multimodal Neural Network Approach.

Emphasis Detection for Voice Dialogue Applications Using Multi-channel Convolutional Bidirectional Long Short-Term Memory Network

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

Modeling Pitch Trajectory by Hierarchical HMM with Minimum Generation Error Training.

HMM-based Speech Synthesis with a Flexible Mandarin Stress Adaptation Model