Abstract:The emotion of humans is an important indicator or reflection of their mental states, e.g., satisfaction or stress, and recognizing or detecting emotion from different media is essential to perform sequence analysis or for certain applications, e.g., mental health assessments, job stress level estimation, and tourist satisfaction assessments. Emotion recognition based on computer vision techniques, as an important method of detecting emotion from visual media (e.g., images or videos) of human behaviors with the use of plentiful emotional cues, has been extensively investigated because of its significant applications. However, most existing models neglect inter-feature interaction and use simple concatenation for feature fusion, failing to capture the crucial complementary gains between face and context information in video clips, which is significant in addressing the problems of emotion confusion and emotion misunderstanding. Accordingly, in this paper, to fully exploit the complementary information between face and context features, we present a novel cross-attention and hybrid feature weighting network to achieve accurate emotion recognition from large-scale video clips, and the proposed model consists of a dual-branch encoding (DBE) network, a hierarchical-attention encoding (HAE) network, and a deep fusion (DF) block. Specifically, the face and context encoding blocks in the DBE network generate the respective shallow features. After this, the HAE network uses the cross-attention (CA) block to investigate and capture the complementarity between facial expression features and their contexts via a cross-channel attention operation. The element recalibration (ER) block is introduced to revise the feature map of each channel by embedding global information. Moreover, the adaptive-attention (AA) block in the HAE network is developed to infer the optimal feature fusion weights and obtain the adaptive emotion features via a hybrid feature weighting operation. Finally, the DF block integrates these adaptive emotion features to predict an individual emotional state. Extensive experimental results of the CAER-S dataset demonstrate the effectiveness of our method, exhibiting its potential in the analysis of tourist reviews with video clips, estimation of job stress levels with visual emotional evidence, or assessments of mental healthiness with visual media.

Affective Video Content Analysis Via Multimodal Deep Quality Embedding Network

Multimodal Deep Denoise Framework for Affective Video Content Analysis.

A Multimodal Deep Regression Bayesian Network For Affective Video Content Analyses

AFFECTIVE VIDEO CONTENT ANALYSES BY USING CROSS-MODAL EMBEDDING LEARNING FEATURES

Affective Analysis for Video Frames Using ConvLSTM Network.

Multimodal Local-Global Attention Network for Affective Video Content Analysis

Affective Video Content Analysis with Adaptive Fusion Recurrent Network

Deep Video Quality Assessment Using Constrained Multi-Task Regression and Spatio-temporal Feature Fusion.

Visual-Texual Emotion Analysis with Deep Coupled Video and Danmu Neural Networks

Unified multi-stage fusion network for affective video content analysis

Deep Blind Video Quality Assessment for User Generated Videos.

EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model

Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks.

MVVA-Net: a Video Aesthetic Quality Assessment Network with Cognitive Fusion of Multi-type Feature–Based Strong Generalization

Knowledge-Augmented Multimodal Deep Regression Bayesian Networks for Emotion Video Tagging

Learning Affective Features with a Hybrid Deep Model for Audio–Visual Emotion Recognition

MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis

A Novel Affective Visualization System for Videos Based on Acoustic and Visual Features

Affection Driven Neural Networks for Sentiment Analysis.

Emotion Recognition from Large-Scale Video Clips with Cross-Attention and Hybrid Feature Weighting Neural Networks

Video Quality Assessment With Serial Dependence Modeling