Dynamic Modality and View Selection for Multimodal Emotion Recognition with Missing Modalities

Luciana Trinkaus Menon,Luiz Carlos Ribeiro Neduziak,Jean Paul Barddal,Alessandro Lameiras Koerich,Alceu de Souza Britto Jr
2024-04-18
Abstract:The study of human emotions, traditionally a cornerstone in fields like psychology and neuroscience, has been profoundly impacted by the advent of artificial intelligence (AI). Multiple channels, such as speech (voice) and facial expressions (image), are crucial in understanding human emotions. However, AI's journey in multimodal emotion recognition (MER) is marked by substantial technical challenges. One significant hurdle is how AI models manage the absence of a particular modality - a frequent occurrence in real-world situations. This study's central focus is assessing the performance and resilience of two strategies when confronted with the lack of one modality: a novel multimodal dynamic modality and view selection and a cross-attention mechanism. Results on the RECOLA dataset show that dynamic selection-based methods are a promising approach for MER. In the missing modalities scenarios, all dynamic selection-based methods outperformed the baseline. The study concludes by emphasizing the intricate interplay between audio and video modalities in emotion prediction, showcasing the adaptability of dynamic selection methods in handling missing modalities.
Machine Learning,Computer Vision and Pattern Recognition,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper primarily explores the issue of handling missing modalities in Multimodal Emotion Recognition (MER). Specifically: 1. **Proposing a new dynamic modality and perspective selection method**: The paper introduces a new method based on dynamic modality and perspective selection to improve the performance of multimodal emotion recognition. 2. **Evaluating the response of different methods to missing modalities**: The paper evaluates two strategies (dynamic selection and attention mechanism) in terms of their performance when specific modalities are missing. The main research questions are: - **RQ1**: Is dynamic selection of modalities and perspectives a promising multimodal AI approach? - **RQ2**: What is the impact on emotion recognition performance when the video or audio modality is missing? Through these research questions, the paper aims to explore how to better understand and handle human emotions in the context of incomplete real-world data. Experimental results show that on the RECOLA dataset, the dynamic selection method demonstrates better adaptability and robustness in handling missing modalities.