Abstract:Video-based Emotional Reaction Intensity (ERI) estimation measures the intensity of subjects' reactions to stimuli along several emotional dimensions from videos of the subject as they view the stimuli. We propose a multi-modal architecture for video-based ERI combining video and audio information. Video input is encoded spatially first, frame-by-frame, combining features encoding holistic aspects of the subjects' facial expressions and features encoding spatially localized aspects of their expressions. Input is then combined across time: from frame-to-frame using gated recurrent units (GRUs), then globally by a transformer. We handle variable video length with a regression token that accumulates information from all frames into a fixed-dimensional vector independent of video length. Audio information is handled similarly: spectral information extracted within each frame is integrated across time by a cascade of GRUs and a transformer with regression token. The video and audio regression tokens' outputs are merged by concatenation, then input to a final fully connected layer producing intensity estimates. Our architecture achieved excellent performance on the Hume-Reaction dataset in the ERI Esimation Challenge of the Fifth Competition on Affective Behavior Analysis in-the-Wild (ABAW5). The Pearson Correlation Coefficients between estimated and subject self-reported scores, averaged across all emotions, were 0.455 on the validation dataset and 0.4547 on the test dataset, well above the baselines. The transformer's self-attention mechanism enables our architecture to focus on the most critical video frames regardless of length. Ablation experiments establish the advantages of combining holistic/local features and of multi-modal integration. Code available at <a class="link-external link-https" href="https://github.com/HKUST-NISL/ABAW5" rel="external noopener nofollow">this https URL</a>.

Temporal Enhancement for Video Affective Content Analysis

Representation Learning Through Multimodal Attention and Time-Sync Comments for Affective Video Content Analysis

TFAE: temporal feature adjustable enhancement for video anomaly detection

TEINet: Towards an Efficient Architecture for Video Recognition.

Exploiting EEG signals and audiovisual feature fusion for video emotion recognition

Multimodal Feature Extraction and Fusion for Emotional Reaction Intensity Estimation and Expression Classification in Videos with Transformers

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval

Detail-Enhanced Intra- and Inter-modal Interaction for Audio-Visual Emotion Recognition

Affective Video Classification Based on Spatio-temporal Feature Fusion

Sentiment Analysis on Online Videos by Time-Sync Comments

Temporal Context Aggregation for Video Retrieval with Contrastive Learning

Enhancing Multimodal Affective Analysis with Learned Live Comment Features

Affective Video Content Analysis: Decade Review and New Perspectives

Transformer Encoder With Multi-Modal Multi-Head Attention for Continuous Affect Recognition

Temporal Modeling Matters: A Novel Temporal Emotional Modeling Approach for Speech Emotion Recognition

Video Classification and Recommendation Based on Affective Analysis of Viewers

Affective Behaviour Analysis via Integrating Multi-Modal Knowledge

Enhancing Multimodal Emotional Information Extraction in Film and Television through Adaptive Feature Fusion with DenseNe, Transformer, and 3D CNN Models

Video affective content analysis: a survey of state of the art methods

Integrating Holistic and Local Information to Estimate Emotional Reaction Intensity

Spatio-temporal feature learning for enhancing video quality based on screen content characteristics