Smile upon the Face but Sadness in the Eyes: Emotion Recognition based on Facial Expressions and Eye Behaviors

Yuanyuan Liu,Lin Wei,Kejun Liu,Yibing Zhan,Zijing Chen,Zhe Chen,Shiguang Shan
2024-11-08
Abstract:Emotion Recognition (ER) is the process of identifying human emotions from given data. Currently, the field heavily relies on facial expression recognition (FER) because facial expressions contain rich emotional cues. However, it is important to note that facial expressions may not always precisely reflect genuine emotions and FER-based results may yield misleading ER. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cues for the creation of a new Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. Different from existing multimodal ER datasets, the EMER dataset employs a stimulus material-induced spontaneous emotion generation method to integrate non-invasive eye behavior data, like eye movements and eye fixation maps, with facial videos, aiming to obtain natural and accurate human emotions. Notably, for the first time, we provide annotations for both ER and FER in the EMER, enabling a comprehensive analysis to better illustrate the gap between both tasks. Furthermore, we specifically design a new EMERT architecture to concurrently enhance performance in both ER and FER by efficiently identifying and bridging the emotion gap between the <a class="link-external link-http" href="http://two.Specifically" rel="external noopener nofollow">this http URL</a>, our EMERT employs modality-adversarial feature decoupling and multi-task Transformer to augment the modeling of eye behaviors, thus providing an effective complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that existing Emotion Recognition (ER) methods rely too much on Facial Expression Recognition (FER), and facial expressions may not accurately reflect an individual's true emotions, thus leading to misleading results in emotion recognition. Specifically, the paper points out: 1. **The gap between facial expressions and true emotions**: Facial expressions can be faked or concealed and cannot truly reflect an individual's emotional state. For example, a person may smile while hiding sadness. 2. **Limitations of existing datasets**: Most of the existing multi - modal emotion databases rely on invasive electroencephalogram (EEG) data or other physiological signals. These data are costly to collect and invasive to participants, limiting the size and application scenarios of the datasets. To solve these problems, the paper introduces eye - movement behavior as a new emotional cue and creates a new multi - modal emotion recognition dataset (EMER). This dataset combines facial expression videos, eye - movement sequences, and eye - movement fixation maps, aiming to capture human emotions more naturally and accurately. In addition, the paper also designs a new Eye - behavior - aided Multimodal Emotion Recognition Transformer (EMERT) architecture to better bridge the gap between facial expressions and true emotions. ### Main contributions 1. **Creation of a new multi - modal emotion recognition dataset (EMER)**: This dataset contains 1,303 high - quality multi - modal samples from 121 participants, providing records of facial expressions and eye - movement behaviors, and supporting a comprehensive analysis of FER and ER labels. 2. **Provision of comprehensive annotation information**: Including coarse - grained (positive, negative, neutral) and fine - grained (happy, sad, fear, surprise, disgust, anger, neutral) emotion labels, as well as continuous emotion scores (valence and arousal). 3. **Design of a new EMERT architecture**: Through adversarial learning and multi - task Transformer, explicitly extract high - level emotion features that are sensitive to modalities, effectively bridging the gap between facial expressions and true emotions. 4. **Comprehensive evaluation of multi - modal methods**: Evaluate multiple multi - modal methods on the EMER dataset and introduce seven benchmark protocols for comprehensive evaluation, further demonstrating the importance of eye - movement behavior in emotion recognition. ### Formula examples To ensure the correctness and readability of formulas, the following are some formula examples mentioned in the paper (represented in Markdown format): - **Weighted voting process**: \[ f_j=\frac{\sum_{i = 1}^{5}\alpha_i t_j^i}{\sum_{i = 1}^{5}\alpha_i} \] where \(t_j^i\in T_j\), and \(\alpha_i\) represents label reliability. - **Emotion score range**: - The score range of **Valence** and **Arousal** is \([- 1,1]\): \[ \text{Valence}\in[-1,1],\quad\text{Arousal}\in[-1,1] \] Through these improvements, the paper aims to promote emotion recognition research, especially bridging the gap between facial expression recognition and true emotion recognition, in order to achieve more robust emotion recognition performance.