Abstract:Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective (weighted and unweighted accuracies) and subjective (mean opinion score) evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations.

SpeechTripleNet: End-to-End Disentangled Speech Representation Learning for Content, Timbre and Prosody

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction

Adversarially Learning Disentangled Speech Representations for Robust Multi-Factor Voice Conversion

Disentangled Speech Representation Learning for One-Shot Cross-Lingual Voice Conversion Using SS-Vae

EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning

Disentangling Voice and Content with Self-Supervision for Speaker Recognition

Triple Disentangled Representation Learning for Multimodal Affective Analysis

Facial Landmark Disentangled Network with Variational Autoencoder

Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement

Disentangling Prosody Representations with Unsupervised Speech Reconstruction

Disentangling Textual and Acoustic Features of Neural Speech Representations

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion.

Unsupervised speech representation learning for behavior modeling using triplet enhanced contextualized networks

MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion

A Multi-Stage Triple-Path Method for Speech Separation in Noisy and Reverberant Environments

An empirical analysis of information encoded in disentangled neural speaker representations

3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement

Disentangled-Transformer: An Explainable End-to-End Automatic Speech Recognition Model with Speech Content-Context Separation

Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

Towards the Next Frontier in Speech Representation Learning Using Disentanglement