Abstract:In this paper, we tackle the problem of predicting the affective responses of movie viewers, based on the content of the movies. Current studies on this topic focus on video representation learning and fusion techniques to combine the extracted features for predicting affect. Yet, these typically, while ignoring the correlation between multiple modality inputs, ignore the correlation between temporal inputs (i.e., sequential features). To explore these correlations, a neural network architecture—namely AttendAffectNet (AAN)—uses the self-attention mechanism for predicting the emotions of movie viewers from different input modalities. Particularly, visual, audio, and text features are considered for predicting emotions (and expressed in terms of valence and arousal). We analyze three variants of our proposed AAN: Feature AAN, Temporal AAN, and Mixed AAN. The Feature AAN applies the self-attention mechanism in an innovative way on the features extracted from the different modalities (including video, audio, and movie subtitles) of a whole movie to, thereby, capture the relationships between them. The Temporal AAN takes the time domain of the movies and the sequential dependency of affective responses into account. In the Temporal AAN, self-attention is applied on the concatenated (multimodal) feature vectors representing different subsequent movie segments. In the Mixed AAN, we combine the strong points of the Feature AAN and the Temporal AAN, by applying self-attention first on vectors of features obtained from different modalities in each movie segment and then on the feature representations of all subsequent (temporal) movie segments. We extensively trained and validated our proposed AAN on both the MediaEval 2016 dataset for the Emotional Impact of Movies Task and the extended COGNIMUSE dataset. Our experiments demonstrate that audio features play a more influential role than those extracted from video and movie subtitles when predicting the emotions of movie viewers on these datasets. The models that use all visual, audio, and text features simultaneously as their inputs performed better than those using features extracted from each modality separately. In addition, the Feature AAN outperformed other AAN variants on the above-mentioned datasets, highlighting the importance of taking different features as context to one another when fusing them. The Feature AAN also performed better than the baseline models when predicting the valence dimension.

Multimodal Continuous Prediction of Emotions in Movies using Long Short-Term Memory Networks

Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network

Enhancing the Prediction of Emotional Experience in Movies using Deep Neural Networks: The Significance of Audio and Language

AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention

Multi-modal Continuous Dimensional Emotion Recognition Using Recurrent Neural Network and Self-Attention Mechanism

AttendAffectNet: Self-Attention based Networks for Predicting Affective Responses from Movies

Recognizing Emotions evoked by Movies using Multitask Learning

Affect2MM: Affective Analysis of Multimedia Content Using Emotion Causality

Long Short Term Memory Recurrent Neural Network Based Multimodal Dimensional Emotion Recognition

Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks

A multimodal convolutional neuro-fuzzy network for emotion understanding of movie clips

Long Short Term Memory Recurrent Neural Network Based Encoding Method for Emotion Recognition in Video.

Enhancing Multimodal Emotional Information Extraction in Film and Television through Adaptive Feature Fusion with DenseNe, Transformer, and 3D CNN Models

Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Deep learning-based late fusion of multimodal information for emotion classification of music video

A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videos

A multimodal shared network with a cross-modal distribution constraint for continuous emotion recognition

Multimodal modelling of human emotion using sound, image and text fusion

Multimodal Transformer Fusion for Continuous Emotion Recognition

Long-term and short-term memory network based movie comment sentiment analysis