Abstract:Emotion Recognition (ER) is the process of identifying human emotions from given data. Currently, the field heavily relies on facial expression recognition (FER) because facial expressions contain rich emotional cues. However, it is important to note that facial expressions may not always precisely reflect genuine emotions and FER-based results may yield misleading ER. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cues for the creation of a new Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. Different from existing multimodal ER datasets, the EMER dataset employs a stimulus material-induced spontaneous emotion generation method to integrate non-invasive eye behavior data, like eye movements and eye fixation maps, with facial videos, aiming to obtain natural and accurate human emotions. Notably, for the first time, we provide annotations for both ER and FER in the EMER, enabling a comprehensive analysis to better illustrate the gap between both tasks. Furthermore, we specifically design a new EMERT architecture to concurrently enhance performance in both ER and FER by efficiently identifying and bridging the emotion gap between the <a class="link-external link-http" href="http://two.Specifically" rel="external noopener nofollow">this http URL</a>, our EMERT employs modality-adversarial feature decoupling and multi-task Transformer to augment the modeling of eye behaviors, thus providing an effective complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance.

MEmoR: A Dataset for Multimodal Emotion Reasoning in Videos

Explainable Multimodal Emotion Reasoning: a Promising Way to Open-set Emotion Recognition

Bridging the Emotional Semantic Gap via Multimodal Relevance Estimation

A Multimodal Dataset for Mixed Emotion Recognition

Counterfactual Scenario-relevant Knowledge-enriched Multi-modal Emotion Reasoning

Personality-aware Human-centric Multimodal Reasoning: A New Task, Dataset and Baselines

Open-vocabulary Multimodal Emotion Recognition: Dataset, Metric, and Benchmark

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework

Multimodal Emotion Recognition by Extracting Common and Modality-Specific Information.

MERBench: A Unified Evaluation Benchmark for Multimodal Emotion Recognition

Smile upon the Face but Sadness in the Eyes: Emotion Recognition based on Facial Expressions and Eye Behaviors

EmoLLM: Multimodal Emotional Understanding Meets Large Language Models

AffectGPT: Dataset and Framework for Explainable Multimodal Emotion Recognition

Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking Dataset

Generative Emotion Cause Explanation in Multimodal Conversations

MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Micro-Expression Dynamics in Video Dialogues

HEU Emotion: A Large-scale Database for Multi-modal Emotion Recognition in the Wild

Enhancing human-like multimodal reasoning: a new challenging dataset and comprehensive framework

Multimodal Emotion Recognition and Sentiment Analysis via Attention Enhanced Recurrent Model

Emotion-Aware Multimodal Fusion for Meme Emotion Detection