Learning Affective Features with a Hybrid Deep Model for Audio–Visual Emotion Recognition

Shiqing Zhang,Shiliang Zhang,Tiejun Huang,Wen Gao,Qi Tian
DOI: https://doi.org/10.1109/tcsvt.2017.2719043
IF: 5.859
2018-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Emotion recognition is challenging due to the emotional gap between emotions and audio-visual features. Motivated by the powerful feature learning ability of deep neural networks, this paper proposes to bridge the emotional gap by using a hybrid deep model, which first produces audio-visual segment features with Convolutional Neural Networks (CNNs) and 3D-CNN, then fuses audio-visual segment features in a Deep Belief Networks (DBNs). The proposed method is trained in two stages. First, CNN and 3D-CNN models pre-trained on corresponding large-scale image and video classification tasks are fine-tuned on emotion recognition tasks to learn audio and visual segment features, respectively. Second, the outputs of CNN and 3D-CNN models are combined into a fusion network built with a DBN model. The fusion network is trained to jointly learn a discriminative audio-visual segment feature representation. After average-pooling segment features learned by DBN to form a fixed-length global video feature, a linear Support Vector Machine is used for video emotion classification. Experimental results on three public audio-visual emotional databases, including the acted RML database, the acted eNTERFACE05 database, and the spontaneous BAUM-ls database, demonstrate the promising performance of the proposed method. To the hest of our knowledge, this is an early work fusing audio and visual cues with CNN, 3D-CNN, and DBN for audio-visual emotion recognition.
What problem does this paper attempt to address?