Abstract:The emotion of humans is an important indicator or reflection of their mental states, e.g., satisfaction or stress, and recognizing or detecting emotion from different media is essential to perform sequence analysis or for certain applications, e.g., mental health assessments, job stress level estimation, and tourist satisfaction assessments. Emotion recognition based on computer vision techniques, as an important method of detecting emotion from visual media (e.g., images or videos) of human behaviors with the use of plentiful emotional cues, has been extensively investigated because of its significant applications. However, most existing models neglect inter-feature interaction and use simple concatenation for feature fusion, failing to capture the crucial complementary gains between face and context information in video clips, which is significant in addressing the problems of emotion confusion and emotion misunderstanding. Accordingly, in this paper, to fully exploit the complementary information between face and context features, we present a novel cross-attention and hybrid feature weighting network to achieve accurate emotion recognition from large-scale video clips, and the proposed model consists of a dual-branch encoding (DBE) network, a hierarchical-attention encoding (HAE) network, and a deep fusion (DF) block. Specifically, the face and context encoding blocks in the DBE network generate the respective shallow features. After this, the HAE network uses the cross-attention (CA) block to investigate and capture the complementarity between facial expression features and their contexts via a cross-channel attention operation. The element recalibration (ER) block is introduced to revise the feature map of each channel by embedding global information. Moreover, the adaptive-attention (AA) block in the HAE network is developed to infer the optimal feature fusion weights and obtain the adaptive emotion features via a hybrid feature weighting operation. Finally, the DF block integrates these adaptive emotion features to predict an individual emotional state. Extensive experimental results of the CAER-S dataset demonstrate the effectiveness of our method, exhibiting its potential in the analysis of tourist reviews with video clips, estimation of job stress levels with visual emotional evidence, or assessments of mental healthiness with visual media.

Multi-scale Temporal Modeling for Dimensional Emotion Recognition in Video

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

A Efficient Multimodal Framework for Large Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals

Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Long Short Term Memory Recurrent Neural Network Based Multimodal Dimensional Emotion Recognition

A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition

Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild

Dimensional emotion recognition using visual and textual cues

A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videos

An Improved Multimodal Dimension Emotion Recognition Based on Different Fusion Methods

Emotion Recognition from Large-Scale Video Clips with Cross-Attention and Hybrid Feature Weighting Neural Networks

Multi Task Sequence Learning for Depression Scale Prediction from Video

Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction

Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Audio Visual Emotion Recognition with Temporal Alignment and Perception Attention

Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023

A Deep Bidirectional Long Short-Term Memory Based Multi-Scale Approach for Music Dynamic Emotion Prediction

Time-Delay Neural Network for Continuous Emotional Dimension Prediction from Facial Expression Sequences.

Efficient Modeling of Long Temporal Contexts for Continuous Emotion Recognition.