Abstract:In this paper, we present our system designed for the video emotion recognition task of the Multimodal Emotion Challenge (MEC 2017). Histogram of Oriented Gradients (HOG), face shape (SHAPE), and geometric (GEO) features are extracted from the detected face images as hand-crafted video features. A pre-trained VGG-Face model is fine-tuned with the face images and emotion labels from the training set of CHEAVD 2.0, the outputs of the penultimate fully-connected layer (FC6) and the last fully-connected layer (FC7) are adopted as Deep Convolutional Neural Network (DCNN) based features. For each video clip, the hand-crafted features and DCNN based features are input into corresponding hidden Markov models (HMMs, one for each emotion class), respectively, for the initial emotion recognitions. The output logarithm likelihood probabilities from the HMMs are then ranked, and the orders constitute an eight-dimensional feature vector as inputs to a Naive Bayes classifier for decision fusion. Experimental results on the CHEAVD 2.0 database show that the combination of FC6, GEO, SHAPE and HOG features obtains the highest macro average precisions (MAPs) on both the validation set (46.61%) and test set (43.88%), which are 12.51% and 22.18% higher than the baseline results, respectively.

Video Emotion Recognition using Hand-Crafted and Deep Learning Features