Abstract:Automatic dubbing, which generates a corresponding version of the input speech in another language, could be widely utilized in many real-world scenarios such as video and game localization. In addition to synthesizing the translated scripts, automatic dubbing needs to further transfer the speaking style in the original language to the dubbed speeches to give audiences the impression that the characters are speaking in their native tongue. However, state-of-the-art automatic dubbing systems only model the transfer on duration and speaking rate, neglecting the other aspects in speaking style such as emotion, intonation and emphasis which are also crucial to fully perform the characters and speech understanding. In this paper, we propose a joint multi-scale cross-lingual speaking style transfer framework to simultaneously model the bidirectional speaking style transfer between languages at both global (i.e. utterance level) and local (i.e. word level) scales. The global and local speaking styles in each language are extracted and utilized to predicted the global and local speaking styles in the other language with an encoder-decoder framework for each direction and a shared bidirectional attention mechanism for both directions. A multi-scale speaking style enhanced FastSpeech 2 is then utilized to synthesize the predicted the global and local speaking styles to speech for each language. Experiment results demonstrate the effectiveness of our proposed framework, which outperforms a baseline with only duration transfer in both objective and subjective evaluations.

Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis.

Unsupervised Multi-scale Expressive Speaking Style Modeling with Hierarchical Context Information for Audiobook Speech Synthesis.

MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis

Towards Multi-Scale Style Control for Expressive Speech Synthesis

Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis

Spoken Style Learning with Multi-modal Hierarchical Context Encoding for Conversational Text-to-Speech Synthesis.

Inferring Speaking Styles from Multi-modal Conversational Context by Multi-scale Relational Graph Convolutional Networks

Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context Modeling

Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information

Learning Hierarchical Representations for Expressive Speaking Style in End-to-End Speech Synthesis

Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised Style Extractor and Hierarchical Modeling in Speech Synthesis

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

Prosodic Modeling with Rich Syntactic Context in HMM-based Mandarin Speech Synthesis

Parsing Hierarchical Prosodic Structure For Mandarin Speech Synthesis

Investigating Deep Neural Network Adaptation for Generating Exclamatory and Interrogative Speech in Mandarin

Joint Multi-scale Cross-lingual Speaking Style Transfer with Bidirectional Attention Mechanism for Automatic Dubbing

A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion Analysis

Style Modeling for Multi-Speaker Articulation-to-Speech

Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition