Context-Aware Based Visual-Audio Feature Fusion for Emotion Recognition

Huijie Cheng,Yun Tie,Lin Qi,Cong Jin
DOI: https://doi.org/10.1109/IJCNN52387.2021.9533473
2021-01-01
Abstract:Video emotion recognition is a significant branch in the field of emotion computing. However, traditional recognition works mainly focus on human features, ignoring the contextual clues of video scenes and objects. In our work, we propose a context-aware framework for bi-modal video emotion recognition. Unlike existing methods that directly extract features of the entire video frame, we extract key frames and key regions of videos to obtain emotional cues contained in video scenes and objects. Specifically, for visual stream, the hierarchical Bidirectional Long-Short Term Memory (Bi-LSTM) is applied to summarize video scenes and find key frames that mostly contribute to video emotion; Meantime, we introduce the Region Proposal Network (RPN) to extract corresponding features of object regions in video frames and construct the emotional similarity graph. After using the Feedforward Neural Network (FNN) to assign different weight coefficients to different regions, the Graph Convolutional Network (GCN) is used to reason about the connections between key regions. Moreover, the context information of the frame-level Log-Mel spectrum fragments supplement the visual information. Finally, we fuse the visual and acoustics features by adaptive gated multimodal fusion module for video emotion classification. We conduct experiments on Video Emotion-8 and Ekman-6 datasets. The experimental results demonstrate that our model achieves better classification accuracy than several baseline models.
What problem does this paper attempt to address?