Emotion Inferring from Large-scale Internet Voice Data: A Multimodal Deep Learning Approach

Suping Zhou,Jia Jia,Yanfeng Wang,Wei Chen,Fanbo Meng,Ya Li,Jianhua Tao
DOI: https://doi.org/10.1109/ACIIAsia.2018.8470311
2018-01-01
Abstract:Voice Dialogue Applications(VDAs) increase popularity nowadays. As the same sentence expressed with different emotion may convey different meanings, inferring emotion from users' queries can help give a more humanized response for VDAs. However, the large-scale Internet voice data involving a tremendous amount of users, bring in a great diversity of users' dialects and expression preferences. Therefore, the traditional speech emotion recognition methods mainly targeting at acted corpora cannot handle the massive and diverse data effectively. In this paper, we propose a semi-supervised Emotion-oriented Bimodal Deep Autoencoder (EBDA) to infer emotion from large-scale Internet voice data. Specifically, as the previous research mainly focuses on acoustic features only, we utilize EBDA to fully integrate both acoustic and textual features. Meanwhile, to employ large-scale unlabeled data to enhance the classification performance, we adopt a semi-supervised strategy. The experimental results on 6 emotion categories based on a dataset collected from Sogou Voice Assistant <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> containing 7.5 million utterances outperform several alternative baselines (+0.18% in terms of F1 on average). Finally, we show some interesting case studies to further demonstrate the practicability of our model.
What problem does this paper attempt to address?