Abstract:Effective emotion inference from user queries helps to give a more personified response for Voice Dialogue Applications(VDAs). The tremendous amounts of VDA users bring in diverse emotion expressions. How to achieve a high emotion inferring performance from large-scale Internet Voice Data in VDAs? Traditionally, researches on speech emotion recognition are based on acted voice datasets, which have limited speakers but strong and clear emotion expressions. Inspired by this, in this paper, we propose a novel approach to leverage acted voice data with strong emotion expressions to enhance large-scale unlabeled internet voice data with diverse emotion expressions for emotion inferring. Specifically, we propose a novel semi-supervised multi-modal curriculum augmentation deep learning framework. First, to learn more general emotion cues, we adopt a curriculum learning based epoch-wise training strategy, which trains our model guided by strong and balanced emotion samples from acted voice data and sub-sequently leverages weak and unbalanced emotion samples from internet voice data.Second, to employ more diverse emotion expressions, we design a Multi-path Mix-match Multimodal Deep Neural Network(MMMD), which effectively learns feature representations for multiple modalities and trains labeled and unlabeled data in hybrid semi-supervised methods for superior generalization and robustness. Experiments on an internet voice dataset with 500,000 utterances show our method outperforms (+10.09% in terms of F1) several alternative baselines, while an acted corpus with 2,397 utterances contributes 4.35%. To further compare our method with state-of-the-art techniques in traditionally acted voice datasets, we also conduct experiments on public dataset IEMOCAP. The results reveal the effectiveness of the proposed approach.

Improve Emotional Speech Synthesis Quality by Learning Explicit and Implicit Representations with Semi-Supervised Training

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Deep Spectrum Feature Representations for Speech Emotion Recognition

Self-attention Transfer Networks for Speech Emotion Recognition

EMOTION CONTROLLABLE SPEECH SYNTHESIS USING EMOTION-UNLABELED DATASET WITH THE ASSISTANCE OF CROSS-DOMAIN SPEECH EMOTION RECOGNITION

Fine-Grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis

Prosody Analysis And Modeling For Emotional Speech Synthesis

ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

Inferring Emotion from Large-scale Internet Voice Data: A Semi-supervised Curriculum Augmentation Based Deep Learning Approach

Boosting Multi-Speaker Expressive Speech Synthesis with Semi-supervised Contrastive Learning

Temporal Modeling Matters: A Novel Temporal Emotional Modeling Approach for Speech Emotion Recognition

Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions

EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model

Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition

Speech Synthesis with Mixed Emotions

Generative Emotional AI for Speech Emotion Recognition: The Case for Synthetic Emotional Speech Augmentation

Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling