Abstract:The emotion of humans is an important indicator or reflection of their mental states, e.g., satisfaction or stress, and recognizing or detecting emotion from different media is essential to perform sequence analysis or for certain applications, e.g., mental health assessments, job stress level estimation, and tourist satisfaction assessments. Emotion recognition based on computer vision techniques, as an important method of detecting emotion from visual media (e.g., images or videos) of human behaviors with the use of plentiful emotional cues, has been extensively investigated because of its significant applications. However, most existing models neglect inter-feature interaction and use simple concatenation for feature fusion, failing to capture the crucial complementary gains between face and context information in video clips, which is significant in addressing the problems of emotion confusion and emotion misunderstanding. Accordingly, in this paper, to fully exploit the complementary information between face and context features, we present a novel cross-attention and hybrid feature weighting network to achieve accurate emotion recognition from large-scale video clips, and the proposed model consists of a dual-branch encoding (DBE) network, a hierarchical-attention encoding (HAE) network, and a deep fusion (DF) block. Specifically, the face and context encoding blocks in the DBE network generate the respective shallow features. After this, the HAE network uses the cross-attention (CA) block to investigate and capture the complementarity between facial expression features and their contexts via a cross-channel attention operation. The element recalibration (ER) block is introduced to revise the feature map of each channel by embedding global information. Moreover, the adaptive-attention (AA) block in the HAE network is developed to infer the optimal feature fusion weights and obtain the adaptive emotion features via a hybrid feature weighting operation. Finally, the DF block integrates these adaptive emotion features to predict an individual emotional state. Extensive experimental results of the CAER-S dataset demonstrate the effectiveness of our method, exhibiting its potential in the analysis of tourist reviews with video clips, estimation of job stress levels with visual emotional evidence, or assessments of mental healthiness with visual media.

Video emotional description with fact reinforcement and emotion awaking

Creating Memorable Video Summaries That Satisfy the User's Intention for Taking the Videos.

Emotional Video Captioning With Vision-Based Emotion Interpretation Network

An Unsupervised Video Summarization Method Based on Multimodal Representation.

Learning topic emotion and logical semantic for video paragraph captioning

A Video Description Model with Improved Attention Mechanism

Video emotion analysis enhanced by recognizing emotion in video comments

Visual-Texual Emotion Analysis with Deep Coupled Video and Danmu Neural Networks

Multimodal interaction enhanced representation learning for video emotion recognition

Temporal Enhancement for Video Affective Content Analysis

Dual-path Collaborative Generation Network for Emotional Video Captioning

FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video Emotion Recognition Inference

Decoding viewer emotions in video ads

Content-Based Video Emotion Tagging Augmented by Users’ Multiple Physiological Responses

How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios

Emotion Recognition from Large-Scale Video Clips with Cross-Attention and Hybrid Feature Weighting Neural Networks

Use of Affective Visual Information for Summarization of Human-Centric Videos

Towards Emotion Analysis in Short-form Videos: A Large-Scale Dataset and Baseline

Enhancing Multimodal Emotional Information Extraction in Film and Television through Adaptive Feature Fusion with DenseNe, Transformer, and 3D CNN Models

Jointly Modeling Embedding and Translation to Bridge Video and Language

StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models