Learning facial expression and body gesture visual information for video emotion recognition

Jie Wei,Guanyu Hu,Xinyu Yang,Anh Tuan Luu,Yizhuo Dong

DOI: https://doi.org/10.1016/j.eswa.2023.121419

IF: 8.5

2024-03-01

Expert Systems with Applications

Abstract:Recent research has shown that facial expressions and body gestures are two significant implications in identifying human emotions. However, these studies mainly focus on contextual information of adjacent frames, and rarely explore the spatio-temporal relationships between distant or global frames. In this paper, we revisit the facial expression and body gesture emotion recognition problems, and propose to improve the performance of video emotion recognition by extracting the spatio-temporal features via further encoding temporal information. Specifically, for facial expression, we propose a super image-based spatio-temporal convolutional model (SISTCM) and a two-stream LSTM model to capture the local spatio-temporal features and learn global temporal cues of emotion changes. For body gestures, a novel representation method and an attention-based channel-wise convolutional model (ACCM) are introduced to learn key joints features and independent characteristics of each joint. Extensive experiments on five common datasets are carried out to prove the superiority of the proposed method, and the results proved learning two visual information leads to significant improvement over the existing state-of-the-art methods.

computer science, artificial intelligence,engineering, electrical & electronic,operations research & management science

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the effective utilization of visual information of facial expressions and body postures in video emotion recognition, especially exploring the spatio - temporal relationships of this information between different or global frames. Current research mainly focuses on the context information of adjacent frames, while rarely exploring the spatio - temporal relationships between long - distance or global frames. For this reason, the author proposes a method to improve the performance of video emotion recognition by further encoding temporal information to extract spatio - temporal features. Specifically, for facial expressions, the paper proposes a super - image - based spatio - temporal convolution model (SISTCM) and a two - stream LSTM model to capture local spatio - temporal features and learn global temporal cues of emotional changes. For body postures, a new representation method and an attention - mechanism - based channel convolution model (ACCM) are introduced to learn key joint features and the independent characteristics of each joint. Through extensive experiments on five common datasets, the superiority of the proposed method is proved, and the results show that learning these two types of visual information can significantly improve the performance of the existing state - of - the - art methods.

Learning facial expression and body gesture visual information for video emotion recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Emotion Recognition via Environmental Context and Human Body

Beyond Facial Expressions: Learning Human Emotion from Body Gestures.

Emotion Recognition From Full-Body Motion Using Multiscale Spatio-Temporal Network

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Learning Expression Features via Deep Residual Attention Networks for Facial Expression Recognition From Video Sequences

An Ensemble Approach for Facial Expression Analysis in Video

Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild

Coarse-to-Fine Cascaded Networks with Smooth Predicting for Video Facial Expression Recognition

FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video Emotion Recognition Inference

SAANet: Siamese Action-Units Attention Network for Improving Dynamic Facial Expression Recognition

Multimodal interaction enhanced representation learning for video emotion recognition

Learning Better Representations for Audio-Visual Emotion Recognition with Common Information

Facial Micro-Expression Recognition Based on Multi-Scale Temporal and Spatial Features

Transfer Spatio-Temporal Knowledge from Emotion-Related Tasks for Facial Expression Spotting.

Survey of deep emotion recognition in dynamic data using facial, speech and textual cues

Continuous Emotion Recognition with Spatiotemporal Convolutional Neural Networks

Automatic Recognition of Facial Displays of Unfelt Emotions

End-to-End Continuous Emotion Recognition from Video Using 3D Convlstm Networks

Learning Dynamics for Video Facial Expression Recognition