Abstract:Our paper focuses on making use of deep neural network models to accurately predict the range of human emotions experienced during watching movies. In this certain setup, there exist three clear-cut input modalities that considerably influence the experienced emotions: visual cues derived from RGB video frames, auditory components encompassing sounds, speech, and music, and linguistic elements encompassing actors' dialogues. Emotions are commonly described using a two-factor model including valence (ranging from happy to sad) and arousal (indicating the intensity of the emotion). In this regard, a Plethora of works have presented a multitude of models aiming to predict valence and arousal from video content. However, non of these models contain all three modalities, with language being consistently eliminated across all of them. In this study, we comprehensively combine all modalities and conduct an analysis to ascertain the importance of each in predicting valence and arousal. Making use of pre-trained neural networks, we represent each input modality in our study. In order to process visual input, we employ pre-trained convolutional neural networks to recognize scenes[1], objects[2], and actions[3,4]. For audio processing, we utilize a specialized neural network designed for handling sound-related tasks, namely SoundNet[5]. Finally, Bidirectional Encoder Representations from Transformers (BERT) models are used to extract linguistic features[6] in our analysis. We report results on the COGNIMUSE dataset[7], where our proposed model outperforms the current state-of-the-art approaches. Surprisingly, our findings reveal that language significantly influences the experienced arousal, while sound emerges as the primary determinant for predicting valence. In contrast, the visual modality exhibits the least impact among all modalities in predicting emotions.

mAnI: Movie Amalgamation using Neural Imitation

Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books

Enhancing the Prediction of Emotional Experience in Movies using Deep Neural Networks: The Significance of Audio and Language

Learning Language-Visual Embedding for Movie Understanding with Natural-Language

ManimML: Communicating Machine Learning Architectures with Animation

TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation

Movienet: a movie multilayer network model using visual and textual semantic cues

NeuroCine: Decoding Vivid Video Sequences from Human Brain Activties

Video SemNet: Memory-Augmented Video Semantic Network

Audio-Visual Sentiment Analysis for Learning Emotional Arcs in Movies

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

A Graph-Based Framework to Bridge Movies and Synopses

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

Folksonomication: Predicting Tags for Movies from Plot Synopses Using Emotion Flow Encoded Neural Network

Enhanced movie content similarity based on textual, auditory and visual information

MindCeive: Perceiving human imagination using CNN-GRU and GANs

Moviescope: Large-scale Analysis of Movies using Multiple Modalities

Adversarial Multimodal Network for Movie Question Answering

MovieCLIP: Visual Scene Recognition in Movies

Neural model based collaborative filtering for movie recommendation system

MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning