Multimodal emotion recognition based on deep neural network

Jiayin Ye,Wenming Zheng,Yang Li,Youyi Cai,Zhen Cui
DOI: https://doi.org/10.3969/j.issn.1003-7985.2017.04.009
2017-01-01
Abstract:In order to increase the accuracy rate of emotion recognition in voice and video,the mixed convolutional neural network (CNN) and recurrent neural network (RNN) are used to encode and integrate the two information sources.For the audio signals,several frequency bands as well as some energy functions are extracted as low-level features by using a sophisticated audio technique,and then they are encoded with a one-dimensional (1D) convolutional neural network to abstract high-level features.Finally,these are fed into a recurrent neural network for the sake of capturing dynamic tone changes in a temporal dimensionality.As a contrast,a two-dimensional (2D) convolutional neural network and a similar RNN are used to capture dynamic facial appearance changes of temporal sequences.The method was used in the Chinese Natural Audio-Visual Emotion Database in the Chinese Conference on Pattern Recognition (CCPR) in 2016.Experimental results demonstrate that the classification average precision of the proposed method is 41.15%,which is increased by 16.62% compared with the baseline algorithm offered by the CCPR in 2016.It is proved that the proposed method has higher accuracy in the identification of emotional information.
What problem does this paper attempt to address?