StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models

Yuxiang Guo,Faizan Siddiqui,Yang Zhao,Rama Chellappa,Shao-Yuan Lo

2024-08-31

Abstract:Predicting and reasoning how a video would make a human feel is crucial for developing socially intelligent systems. Although Multimodal Large Language Models (MLLMs) have shown impressive video understanding capabilities, they tend to focus more on the semantic content of videos, often overlooking emotional stimuli. Hence, most existing MLLMs fall short in estimating viewers' emotional reactions and providing plausible explanations. To address this issue, we propose StimuVAR, a spatiotemporal Stimuli-aware framework for Video Affective Reasoning (VAR) with MLLMs. StimuVAR incorporates a two-level stimuli-aware mechanism: frame-level awareness and token-level awareness. Frame-level awareness involves sampling video frames with events that are most likely to evoke viewers' emotions. Token-level awareness performs tube selection in the token space to make the MLLM concentrate on emotion-triggered spatiotemporal regions. Furthermore, we create VAR instruction data to perform affective training, steering MLLMs' reasoning strengths towards emotional focus and thereby enhancing their affective reasoning ability. To thoroughly assess the effectiveness of VAR, we provide a comprehensive evaluation protocol with extensive metrics. StimuVAR is the first MLLM-based method for viewer-centered VAR. Experiments demonstrate its superiority in understanding viewers' emotional responses to videos and providing coherent and insightful explanations.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper aims to address the issue of how to predict and reason about the emotional responses that videos may elicit from viewers. Specifically, the paper focuses on developing socially intelligent systems capable of understanding human emotional reactions to videos. Although existing Multimodal Large Language Models (MLLMs) perform well in video understanding, they often emphasize the semantic content of videos while neglecting emotional stimuli. Consequently, most existing MLLMs fall short in estimating viewers' emotional responses and providing reasonable explanations. To overcome this problem, the authors propose StimuVAR, a spatiotemporal affective stimulus-aware framework based on MLLMs for Video Affective Reasoning (VAR). StimuVAR identifies spatiotemporal affective stimuli through the following two mechanisms: 1. **Frame-level Perception**: An event-driven frame sampling strategy is used to select video frames that are most likely to elicit emotional responses from viewers. 2. **Token-level Perception**: An emotion-triggered token selection strategy is employed to choose spatiotemporal regions in the token space that trigger emotions, allowing the MLLM to focus its attention. Additionally, the authors created VAR instruction data for affective training to enhance the emotional reasoning capabilities of MLLMs. Experimental results demonstrate that StimuVAR excels in understanding and explaining viewers' emotional responses to videos.

StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models

Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

ST-LLM: Large Language Models Are Effective Temporal Learners

MSEVA : A System for Multimodal Short Videos Emotion Visual Analysis

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Knowledge-Augmented Multimodal Deep Regression Bayesian Networks for Emotion Video Tagging

Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

Representation Learning Through Multimodal Attention and Time-Sync Comments for Affective Video Content Analysis

A Multimodal Deep Regression Bayesian Network For Affective Video Content Analyses

Exploring the Design Space of Visual Context Representation in Video MLLMs

Synchronous Prediction of Arousal and Valence Using LSTM Network for Affective Video Content Analysis

Versatile audio-visual learning for emotion recognition

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios

Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding