Fusion in Context: A Multimodal Approach to Affective State Recognition

Youssef Mohamed,Severin Lemaignan,Arzu Guneysu,Patric Jensfelt,Christian Smith
2024-09-18
Abstract:Accurate recognition of human emotions is a crucial challenge in affective computing and human-robot interaction (HRI). Emotional states play a vital role in shaping behaviors, decisions, and social interactions. However, emotional expressions can be influenced by contextual factors, leading to misinterpretations if context is not considered. Multimodal fusion, combining modalities like facial expressions, speech, and physiological signals, has shown promise in improving affect recognition. This paper proposes a transformer-based multimodal fusion approach that leverages facial thermal data, facial action units, and textual context information for context-aware emotion recognition. We explore modality-specific encoders to learn tailored representations, which are then fused using additive fusion and processed by a shared transformer encoder to capture temporal dependencies and interactions. The proposed method is evaluated on a dataset collected from participants engaged in a tangible tabletop Pacman game designed to induce various affective states. Our results demonstrate the effectiveness of incorporating contextual information and multimodal fusion for affective state recognition.
Robotics
What problem does this paper attempt to address?
The paper aims to address the issue of accurately recognizing human emotional states in the fields of affective computing and human-robot interaction (HRI). Specifically, the paper proposes a Transformer-based multimodal fusion method that combines facial thermal data, facial action units (AUs), and textual contextual information to achieve context-aware emotion recognition. In this way, the paper attempts to overcome the limitations of unimodal approaches in emotion recognition and demonstrates the effectiveness of integrating contextual information and multimodal data in real-world emotion recognition scenarios. The method was evaluated on a dataset collected from subjects participating in a "Pac-Man" game on a desktop, showcasing the importance of incorporating contextual information into the emotion recognition process and its potential in improving the accuracy of emotional state detection. Experimental results indicate that the multimodal fusion method significantly enhances the accuracy of emotion recognition compared to using a single modality alone.