Audio-Visual Emotion Recognition with Capsule-like Feature Representation and Model-Based Reinforcement Learning

Xi Ouyang,Srikanth Nagisetty,Ester Gue Hua Goh,Shengmei Shen,Wan Ding,Huaiping Ming,Dong-Yan Huang
DOI: https://doi.org/10.1109/aciiasia.2018.8470316
2018-01-01
Abstract:This paper presents the techniques used in our contribution to Multimodal Emotion Recognition Challenge (MEC 2017). The purpose of the challenge is to classify the eight basic emotions (happy, sad, angry, worried, anxious, surprise, disgust and neutral) from Chinese Natural Audio-Visual Emotion Database (CHEAVD) 2.0 selected from Chinese movies and TV programs. As racial expressions are caused by the movement of racial features such as the mouth and eyebrows, a capsule like feature representation is proposed to captures not only the existences of static racial emotions in video frames but also the instantiation parameters. In order to further improve the performance of emotion classification accuracy, a model based reinforcement learning is proposed for audio-visual fusion method, which exploits feedbacks of submission on challenge testing dataset as rewards to learn the fusion model. The overall accuracy of proposed approach on test dataset is 52.3% and the macro average precision is 39.7%. The performance achieves the top 2 of the MEC2017 audio-visual sub challenge.
What problem does this paper attempt to address?