AuxFormer: Robust Approach to Audiovisual Emotion Recognition

Lucas Goncalves,Carlos Busso
DOI: https://doi.org/10.1109/icassp43922.2022.9747157
2022-05-23
Abstract:A challenging task in audiovisual emotion recognition is to implement neural network architectures that can leverage and fuse multimodal information while temporally aligning modalities, handling missing modalities, and capturing information from all modalities without losing information during training. These requirements are important to achieve model robustness and to increase accuracy on the emotion recognition task. A recent approach to perform multimodal fusion is to use the transformer architecture to properly fuse and align the modalities. This study proposes the AuxFormer framework, which addresses in a principled way the aforementioned challenges. AuxFormer combines the transformer framework with auxiliary networks. It uses shared losses to infuse information from single-modality networks that are separately embedded. The extra layer of audiovisual information added to our main network retains information that would otherwise be lost during training. The results show that the AuxFormer architecture achieves macro and micro F1Scores of 71.3% and 71.7%, respectively, on the CREMA-D corpus. For the MSP-IMPROV corpus, AuxFormer achieves a macro and micro F1-Scores of 70.4% and 76.5%, respectively. The results for both corpora are significantly better than strong baselines, indicating that our framework benefits from auxiliary networks. We also show that under non-ideal conditions (e.g., missing modalities) our architecture is able to sustain strong performance under audio-only and video-only scenarios, benefiting from a optimized training strategy.
What problem does this paper attempt to address?