Integrating gating and learned queries in audiovisual emotion recognition

Zaifang Zhang,Qing Guo,Shunlu Lu,Junyi Su,Tao Tang
DOI: https://doi.org/10.1007/s00530-024-01551-1
IF: 3.9
2024-12-04
Multimedia Systems
Abstract:Emotion recognition, an important bridge in human-computer interaction, has attracted significant interest. Although numerous studies have made progress in auditory and visual information, effective integration of these two modalities remains a significant challenge. This paper proposes an audiovisual emotion recognition model that achieves more accurate cross-modal fusion by introducing a private enhancement module (PEM) and a shared learned module (SLM). In PEM, a token-level gating mechanism dynamically adjusts feature expression within tokens, while SLM employs learned queries to effectively comprehend differences between modalities, achieving more precise cross-modal fusion. Our model underwent rigorous testing on CREMA-D and IEMOCAP datasets, demonstrating superior recognition capabilities in comparison to existing advanced emotion recognition models. Lastly, through ablation studies, this paper extensively examines the roles and contributions of different modules within the model.
computer science, information systems, theory & methods
What problem does this paper attempt to address?