Abstract:Humans can perceive subtle emotions from various cues and contexts, even without hearing or seeing others. However, existing video datasets mainly focus on recognizing the emotions of the speakers from complete modalities. In this work, we present the task of multimodal emotion reasoning in videos. Beyond directly recognizing emotions from multimodal signals of target persons, this task requires a machine capable of reasoning about human emotions from the contexts and surrounding world. To facilitate the study towards this task, we introduce a new dataset, MEmoR, that provides fine-grained emotion annotations for both speakers and non-speakers. The videos in MEmoR are collected from TV shows closely in real-life scenarios. In these videos, while speakers may be non-visually described, non-speakers always deliver no audio-textual signals and are often visually inconspicuous. This modality-missing characteristic makes MEmoR a more practical yet challenging testbed for multimodal emotion reasoning. In support of various reasoning behaviors, the proposed MEmoR dataset provides both short-term contexts and external knowledge. We further propose an attention-based reasoning approach to model the intra-personal emotion contexts, inter-personal emotion propagation, and the personalities of different individuals. Experimental results demonstrate that our proposed approach outperforms related baselines significantly. We isolate and analyze the validity of different reasoning modules across various emotions of speakers and non-speakers. Finally, we draw forth several future research directions for multimodal emotion reasoning with MEmoR, aiming to empower high Emotional Quotient (EQ) in modern artificial intelligence systems. The code and dataset released on https://github.com/sunlightsgy/MEmoR.

Korean Drama Scene Transcript Dataset for Emotion Recognition in Conversations

K-EmoCon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations

EAV: EEG-Audio-Video Dataset for Emotion Recognition in Conversational Contexts

Emotion Detection on TV Show Transcripts with Sequence-based Convolutional Neural Networks

M3ED: Multi-modal Multi-scene Multi-label Emotional Dialogue Database

EmotionLines: An Emotion Corpus of Multi-Party Conversations

EmoInHindi: A Multi-label Emotion and Intensity Annotated Dataset in Hindi for Emotion Recognition in Dialogues

Context Based Emotion Recognition Using EMOTIC Dataset

Acting Emotions: a comprehensive dataset of elicited emotions

E‐Speech: Development of a Dataset for Speech Emotion Recognition and Analysis

Fine-grained Emotion and Intent Learning in Movie Dialogues

Multimedia emotion prediction using movie script and spectrogram

PhyMER: Physiological Dataset for Multimodal Emotion Recognition With Personality as a Context

EMOVOME: A Dataset for Emotion Recognition in Spontaneous Real-Life Speech

TamilEmo: Finegrained Emotion Detection Dataset for Tamil

MEmoR: A Dataset for Multimodal Emotion Reasoning in Videos

How you feelin'? Learning Emotions and Mental States in Movie Scenes

EEG Dataset for the Recognition of Different Emotions Induced in Voice-User Interaction

A corpus-based approach to classifying emotions using Korean linguistic features

Building and validation of a set of facial expression images to detect emotions: a transcultural study

Text and Sound-Based Feature Extraction and Speech Emotion Classification for Korean