Audio Visual Recognition of Spontaneous Emotions In-the-Wild.
Xiaohan Xia,Liyong Guo,Dongmei Jiang,Ercheng Pei,Le Yang,Hichem Sahli
DOI: https://doi.org/10.1007/978-981-10-3005-5_57
2016-01-01
Abstract:In this paper, we target the CCPR 2016 Multimodal Emotion Recognition Challenge (MEC 2016) which is based on the Chinese Natural Audio-Visual Emotion Database (CHEAVD) of movies and TV programs showing (nearly) spontaneous human emotions. Low level descriptors (LLDs) are proposed as audio features. As visual features, we propose using histogram of oriented gradients (HOG), local phase quantisation (LPQ), shape features and behavior-related features such as head pose and eye gaze. The visual features are post processed to delete or smooth the all-zero feature vector segments. Single modal emotion recognition is performed using fully connected hidden Markov models (HMMs). For multimodal emotion recognition, two schemes are proposed: in the first scheme the normalized probability vectors from the HMMs are input to a support vector machine (SVM) for final recognition. For the second scheme, the final emotion is estimated using audio or video features depending if the face has been detected on the full video. Moreover, to make full use of the labeled data and to overcome the problem of unbalanced data, we use the training set and validation set together to train the HMMs and SVMs with parameters optimized via cross-validation experiments. Experimental results on the test set show that the macro average precisions (MAPs) of audio, visual, and multimodal emotion recognition reach \(42.85\,\%\), \(54.24\,\%\), and \(53.90\,\%\), respectively, which are much higher than the corresponding baseline results of \(24.02\,\%\), \(34.28\,\%\), and \(30.63\,\%\).