Abstract:Video-based Emotional Reaction Intensity (ERI) estimation measures the intensity of subjects' reactions to stimuli along several emotional dimensions from videos of the subject as they view the stimuli. We propose a multi-modal architecture for video-based ERI combining video and audio information. Video input is encoded spatially first, frame-by-frame, combining features encoding holistic aspects of the subjects' facial expressions and features encoding spatially localized aspects of their expressions. Input is then combined across time: from frame-to-frame using gated recurrent units (GRUs), then globally by a transformer. We handle variable video length with a regression token that accumulates information from all frames into a fixed-dimensional vector independent of video length. Audio information is handled similarly: spectral information extracted within each frame is integrated across time by a cascade of GRUs and a transformer with regression token. The video and audio regression tokens' outputs are merged by concatenation, then input to a final fully connected layer producing intensity estimates. Our architecture achieved excellent performance on the Hume-Reaction dataset in the ERI Esimation Challenge of the Fifth Competition on Affective Behavior Analysis in-the-Wild (ABAW5). The Pearson Correlation Coefficients between estimated and subject self-reported scores, averaged across all emotions, were 0.455 on the validation dataset and 0.4547 on the test dataset, well above the baselines. The transformer's self-attention mechanism enables our architecture to focus on the most critical video frames regardless of length. Ablation experiments establish the advantages of combining holistic/local features and of multi-modal integration. Code available at <a class="link-external link-https" href="https://github.com/HKUST-NISL/ABAW5" rel="external noopener nofollow">this https URL</a>.

Estimating Gradual-Emotional Behavior in One-Minute Videos with ESNs

OMG - Emotion Challenge Solution

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Dimensional emotion recognition using visual and textual cues

A Multi-component CNN-RNN Approach for Dimensional Emotion Recognition in-the-wild

Long Short Term Memory Recurrent Neural Network Based Encoding Method for Emotion Recognition in Video.

The OMG-Emotion Behavior Dataset

Time-Delay Neural Network for Continuous Emotional Dimension Prediction from Facial Expression Sequences.

An Ensemble Approach for Facial Expression Analysis in Video

Classifying Emotions and Engagement in Online Learning Based on a Single Facial Expression Recognition Neural Network

Computer Vision Estimation of Emotion Reaction Intensity in the Wild

Integrating Holistic and Local Information to Estimate Emotional Reaction Intensity

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Multi-modal Emotion Reaction Intensity Estimation with Temporal Augmentation.

Neuromorphic Valence and Arousal Estimation

Emotion Recognition for In-the-wild Videos

End-to-End Continuous Emotion Recognition from Video Using 3D Convlstm Networks

Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network

Channel attention convolutional aggregation network based on video-level features for EEG emotion recognition

Emotional Reaction Intensity Estimation Based on Multimodal Data

Less is More: Sparse Sampling for Dense Reaction Predictions