Robust Audiovisual Emotion Recognition: Aligning Modalities, Capturing Temporal Information, and Handling Missing Features

Lucas Goncalves,Carlos Busso
DOI: https://doi.org/10.1109/taffc.2022.3216993
IF: 13.99
2022-11-29
IEEE Transactions on Affective Computing
Abstract:Emotion recognition using audiovisual features is a challenging task for human-machine interaction systems. Under ideal conditions (perfect illumination, clean speech signals, and non-occluded visual data) many systems are able to achieve reliable results. However, few studies have considered developing multimodal systems and training strategies to build systems that can perform well under non ideal conditions. Audiovisual models still face challenging problems such as misalignment of modalities, lack of temporal modeling, and missing features due to noise or occlusions. In this article, we implement a model that combines auxiliary networks, a transformer architecture, and an optimized training mechanism to achieve a robust system for audiovisual emotion recognition that addresses, in a principled way, these challenges. Our evaluation analyzes how well this model performs in ideal conditions and when modalities are missing. We contrast this method with other multimodal fusion methods for emotion recognition. Our experimental results based on two audiovisual databases demonstrate that the proposed framework achieves: 1) improvements in emotion recognition accuracy, 2) better alignment and fusion of audiovisual features at the model level, 3) awareness of temporal information, and 4) robustness to non-ideal scenarios.
computer science, cybernetics, artificial intelligence
What problem does this paper attempt to address?