Abstract:In this paper, we tackle the problem of predicting the affective responses of movie viewers, based on the content of the movies. Current studies on this topic focus on video representation learning and fusion techniques to combine the extracted features for predicting affect. Yet, these typically, while ignoring the correlation between multiple modality inputs, ignore the correlation between temporal inputs (i.e., sequential features). To explore these correlations, a neural network architecture—namely AttendAffectNet (AAN)—uses the self-attention mechanism for predicting the emotions of movie viewers from different input modalities. Particularly, visual, audio, and text features are considered for predicting emotions (and expressed in terms of valence and arousal). We analyze three variants of our proposed AAN: Feature AAN, Temporal AAN, and Mixed AAN. The Feature AAN applies the self-attention mechanism in an innovative way on the features extracted from the different modalities (including video, audio, and movie subtitles) of a whole movie to, thereby, capture the relationships between them. The Temporal AAN takes the time domain of the movies and the sequential dependency of affective responses into account. In the Temporal AAN, self-attention is applied on the concatenated (multimodal) feature vectors representing different subsequent movie segments. In the Mixed AAN, we combine the strong points of the Feature AAN and the Temporal AAN, by applying self-attention first on vectors of features obtained from different modalities in each movie segment and then on the feature representations of all subsequent (temporal) movie segments. We extensively trained and validated our proposed AAN on both the MediaEval 2016 dataset for the Emotional Impact of Movies Task and the extended COGNIMUSE dataset. Our experiments demonstrate that audio features play a more influential role than those extracted from video and movie subtitles when predicting the emotions of movie viewers on these datasets. The models that use all visual, audio, and text features simultaneously as their inputs performed better than those using features extracted from each modality separately. In addition, the Feature AAN outperformed other AAN variants on the above-mentioned datasets, highlighting the importance of taking different features as context to one another when fusing them. The Feature AAN also performed better than the baseline models when predicting the valence dimension.

Representation Learning Through Multimodal Attention and Time-Sync Comments for Affective Video Content Analysis

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Temporal Enhancement for Video Affective Content Analysis

Affective Video Classification Based on Spatio-temporal Feature Fusion

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

Temporal Multimodal Fusion for Video Emotion Classification in the Wild

A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videos

Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation

Multimodal interaction enhanced representation learning for video emotion recognition

Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild

Multimodal Fusion and Coherence Modeling for Video Topic Segmentation

Multimodal emotion recognition from facial expression and speech based on feature fusion

Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations

Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning

Video Sentiment Analysis with Bimodal Information-augmented Multi-Head Attention

Multi-scale Temporal Modeling for Dimensional Emotion Recognition in Video

AMSA: Adaptive Multimodal Learning for Sentiment Analysis

Audio Visual Emotion Recognition with Temporal Alignment and Perception Attention

AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention

Temporal Context Aggregation for Video Retrieval with Contrastive Learning

Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos