Visual Scene-aware Hybrid Neural Network Architecture for Video-based Facial Expression Recognition
D. Kim,B. Song,Min Kyu Lee,D. Choi
DOI: https://doi.org/10.1109/FG.2019.8756551
2019-05-14
Abstract:With rapid development of deep learning, facial expression recognition (FER) technology has made considerable progress recently. However, since conventional FER techniques are mainly designed and learned for videos which are artificially acquired in a limited environment, they may not operate robustly on videos acquired in a wild environment. To solve this problem, this paper proposes a scene-aware hybrid neural network (NN) having a novel combination of three-dimensional (3D) convolutional NN (CNN), 2D CNN and recurrent NN (RNN). The characteristics of the proposed network are as follows. First, we extract video-based global features and frame-based local features at the same time. In detail, the latent features containing the overall visual scene of a given video are extracted by 3D CNN with auxiliary classifier, and fine-tuned 2D CNN is adopted to extract latent features containing small details from each frame. Second, RNN not only performs temporal domain learning, but also feature-wise fuses two latent features extracted from the networks. For effective fusion, we also present three RNN schemes. Third, the proposed network, in which the above-mentioned methods collaborate, works very robust in a wild environment as well as in a limited environment. Extensive experiments show that the proposed network provides an average accuracy of 49.9% for AFEW dataset, i.e., a representative wild dataset, and an amazing accuracy of 98.2% for another CK+ dataset. We also show that the proposed network outperforms the state-of-the-art network(s).
Computer Science