Abstract:Effective emotion inference from user queries helps to give a more personified response for Voice Dialogue Applications(VDAs). The tremendous amounts of VDA users bring in diverse emotion expressions. How to achieve a high emotion inferring performance from large-scale Internet Voice Data in VDAs? Traditionally, researches on speech emotion recognition are based on acted voice datasets, which have limited speakers but strong and clear emotion expressions. Inspired by this, in this paper, we propose a novel approach to leverage acted voice data with strong emotion expressions to enhance large-scale unlabeled internet voice data with diverse emotion expressions for emotion inferring. Specifically, we propose a novel semi-supervised multi-modal curriculum augmentation deep learning framework. First, to learn more general emotion cues, we adopt a curriculum learning based epoch-wise training strategy, which trains our model guided by strong and balanced emotion samples from acted voice data and sub-sequently leverages weak and unbalanced emotion samples from internet voice data.Second, to employ more diverse emotion expressions, we design a Multi-path Mix-match Multimodal Deep Neural Network(MMMD), which effectively learns feature representations for multiple modalities and trains labeled and unlabeled data in hybrid semi-supervised methods for superior generalization and robustness. Experiments on an internet voice dataset with 500,000 utterances show our method outperforms (+10.09% in terms of F1) several alternative baselines, while an acted corpus with 2,397 utterances contributes 4.35%. To further compare our method with state-of-the-art techniques in traditionally acted voice datasets, we also conduct experiments on public dataset IEMOCAP. The results reveal the effectiveness of the proposed approach.

Inferring Users' Emotions For Human-Mobile Voice Dialogue Applications

Inferring Emotions from Large-Scale Internet Voice Data.

Emotion Inferring from Large-scale Internet Voice Data: A Multimodal Deep Learning Approach

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Inferring Emotion from Large-scale Internet Voice Data: A Semi-supervised Curriculum Augmentation Based Deep Learning Approach

Inferring Emotion from Conversational Voice Data: A Semi-Supervised Multi-Path Generative Neural Network Approach.

Deep Spectrum Feature Representations for Speech Emotion Recognition

Self-attention Transfer Networks for Speech Emotion Recognition

Learning to Infer Public Emotions from Large-Scale Networked Voice Data

Inferring Emphasis for Real Voice Data: an Attentive Multimodal Neural Network Approach.

Inferring User Emotive State Changes in Realistic Human-Computer Conversational Dialogs.

Emphasis Detection for Voice Dialogue Applications Using Multi-channel Convolutional Bidirectional Long Short-Term Memory Network

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Emotion Detection from Speech to Enrich Multimedia Content

The Acoustically Emotion-Aware Conversational Agent with Speech Emotion Recognition and Empathetic Responses

Acoustics, Content and Geo-Information Based Sentiment Prediction from Large-Scale Networked Voice Data

Deep Learning and SVM-based Emotion Recognition from Chinese Speech for Smart Affective Services

Enhancing the Perceived Emotional Intelligence of Conversational Agents Through Acoustic Cues.

Research on Chinese Speech Emotion Recognition Based on Deep Neural Network and Acoustic Features

Wav2vec2. 0 and Context Emotional Information Compensation Based Dialogue Speech Emotion Recognition

Affective Voice Interaction and Artificial Intelligence: A Research Study on the Acoustic Features of Gender and the Emotional States of the PAD Model