Abstract:Effective emotion inference from user queries helps to give a more personified response for Voice Dialogue Applications(VDAs). The tremendous amounts of VDA users bring in diverse emotion expressions. How to achieve a high emotion inferring performance from large-scale Internet Voice Data in VDAs? Traditionally, researches on speech emotion recognition are based on acted voice datasets, which have limited speakers but strong and clear emotion expressions. Inspired by this, in this paper, we propose a novel approach to leverage acted voice data with strong emotion expressions to enhance large-scale unlabeled internet voice data with diverse emotion expressions for emotion inferring. Specifically, we propose a novel semi-supervised multi-modal curriculum augmentation deep learning framework. First, to learn more general emotion cues, we adopt a curriculum learning based epoch-wise training strategy, which trains our model guided by strong and balanced emotion samples from acted voice data and sub-sequently leverages weak and unbalanced emotion samples from internet voice data.Second, to employ more diverse emotion expressions, we design a Multi-path Mix-match Multimodal Deep Neural Network(MMMD), which effectively learns feature representations for multiple modalities and trains labeled and unlabeled data in hybrid semi-supervised methods for superior generalization and robustness. Experiments on an internet voice dataset with 500,000 utterances show our method outperforms (+10.09% in terms of F1) several alternative baselines, while an acted corpus with 2,397 utterances contributes 4.35%. To further compare our method with state-of-the-art techniques in traditionally acted voice datasets, we also conduct experiments on public dataset IEMOCAP. The results reveal the effectiveness of the proposed approach.

Active Learning For Dimensional Speech Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Self-attention Transfer Networks for Speech Emotion Recognition

Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech

Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network

Learning Representations of Emotional Speech with Deep Convolutional Generative Adversarial Networks

Continuous Metric Learning For Transferable Speech Emotion Recognition and Embedding Across Low-resource Languages

Long Short Term Memory Recurrent Neural Network Based Multimodal Dimensional Emotion Recognition

Speech Emotion Recognition Based on Linear Discriminant Analysis and Support Vector Machine Decision Tree

Investigating salient representations and label Variance in Dimensional Speech Emotion Analysis

Articulation constrained learning with application to speech emotion recognition

An Exploration of Active Learning for Affective Digital Phenotyping

Unsupervised Feature Learning for Speech Emotion Recognition Based on Autoencoder

Dimensional Emotion Detection from Categorical Emotion

Improving Speaker-independent Speech Emotion Recognition Using Dynamic Joint Distribution Adaptation

Speech Emotion Recognition with Emotion-Pair Based Framework Considering Emotion Distribution Information in Dimensional Emotion Space.

Automatic Emotion Variation Detection in Continuous Speech.

Inferring Emotion from Large-scale Internet Voice Data: A Semi-supervised Curriculum Augmentation Based Deep Learning Approach

A Discriminative Feature Representation Method Based on Cascaded Attention Network With Adversarial Strategy for Speech Emotion Recognition

Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition

Learning Utterance-level Representations with Label Smoothing for Speech Emotion Recognition