JTMA: Joint Multimodal Feature Fusion and Temporal Multi-head Attention for Humor Detection

Qi Li,Yangyang Xu,Zhuoer Zhao,Shulei Tang,Feixiang Zhang,Ruotong Wang,Xiao Sun,Meng Wang
DOI: https://doi.org/10.1145/3606039.3613112
2023-10-29
Abstract:In this paper, we propose a model named Joint multimodal feature fusion and Temporal Multi-head Attention (JTMA) to solve the MuSe-Humor sub-challenge in Multimodal Sentiment Analysis Challenge 2023. The goal of MuSe-Humor sub-challenge is to predict whether humor occurs in the given dataset that includes data from multiple modalities (e.g., text, audio and video). The cross-cultural testing presents a new challenge that makes it different from the previous years. To solve the above problems, the proposed model JTMA firstly uses a 1-D CNN to aggregate temporal information within the unimodal feature. Then the interactions of inter-modality and intra-modality are performed by the multimodal feature encoder module. Finally, we integrate the high-level representations learned from multiple modalities to accurately predict humor. The effectiveness of our proposed model is demonstrated through experimental results obtained on the official test set. Our model achieves an impressive AUC score of 0.8889, surpassing the performance of all other participants in the competition, and securing the Top 1 ranking.
Computer Science
What problem does this paper attempt to address?