Topic and Style-aware Transformer for Multimodal Emotion Recognition

Shuwen Qiu,Prateek Singhal,Nitesh Sekhar
DOI: https://doi.org/10.18653/v1/2023.findings-acl.130
Abstract:Understanding emotion expressions in multi-modal signals is key for machines to have a better understanding of human communication. While language, visual and acoustic modalities can provide clues from different perspectives, the visual modality is shown to make minimal contribution to the performance in the emotion recognition field due to its high dimensionality. Therefore, we first leverage the strong multi-modality backbone VATT to project the visual signal to the common space with language and acoustic signals. Also, we propose content-oriented features Topic and Speaking style on top of it to approach the subjectivity issues. Experiments conducted on the benchmark dataset MOSEI show our model can outperform SOTA results and effectively incorporate visual signals and handle subjectivity issues by serving as content "normalization".
Computer Science
What problem does this paper attempt to address?