Abstract:Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective (weighted and unweighted accuracies) and subjective (mean opinion score) evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations.

PE-wav2vec: A Prosody-Enhanced Speech Model for Self-Supervised Prosody Learning in TTS

ViSPer: A Multilingual TTS Approach Based on VITS Using Deep Feature Loss

Prosody Modelling with Pre-trained Cross-utterance Representations for Improved Speech Synthesis

Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech

Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit

Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data

Discourse-Level Prosody Modeling with a Variational Autoencoder for Non-Autoregressive Expressive Speech Synthesis

Automatic Conversion from Lexical Words to Prosodic Words for Mandarin Text-to-speech System

ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis

Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP

Rep2wav: Noise Robust text-to-speech Using self-supervised representations

IMPROVING NATURALNESS AND CONTROLLABILITY OF SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS BY LEARNING LOCAL PROSODY REPRESENTATIONS

Speech BERT Embedding For Improving Prosody in Neural TTS

Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis

PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units

Modeling Prosody Patterns for Chinese Expressive Text-to-speech Synthesis

Automatic Prosody Annotation with Pre-Trained Text-Speech Model

Disentangling Prosody Representations with Unsupervised Speech Reconstruction

Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning