An End-to-End Transformer with Progressive Tri-Modal Attention for Multi-modal Emotion Recognition.

Yang Wu,Pai Peng,Zhenyu Zhang,Yanyan Zhao,Bing Qin
DOI: https://doi.org/10.1007/978-981-99-8540-1_32
2024-01-01
Abstract:Recent works on multi-modal emotion recognition move towards end-to-end models, which can extract the task-specific features supervised by the target task compared with the two-phase pipeline. In this paper, we propose a novel multi-modal end-to-end transformer for emotion recognition, which can effectively model the tri-modal features interaction among the textual, acoustic, and visual modalities at the low-level and high-level. At the low-level, we propose the progressive tri-modal attention, which can model the tri-modal feature interactions by adopting a two-pass strategy and can further leverage such interactions to significantly reduce the computation and memory complexity through reducing the input token length. At the high-level, we introduce the tri-modal feature fusion layer to explicitly aggregate the semantic representations of three modalities. The experimental results on the CMU-MOSEI and IEMOCAP datasets show that ME2ET achieves the state-of-the-art performance. The further in-depth analysis demonstrates the effectiveness, efficiency, and interpretability of the proposed tri-modal attention, which can help our model to achieve better performance while significantly reducing the computation and memory cost (Our code is available at https://github.com/SCIR-MSA-Team/UFMAC.).
What problem does this paper attempt to address?