Multi-modal Intent Detection with LVAMoE: the Language-Visual-Audio Mixture of Experts

Tingyu Li,Junpeng Bao,Jiaqi Qin,Yuping Liang,Ruijiang Zhang,Jason Wang
DOI: https://doi.org/10.1109/icme57554.2024.10688018
2024-01-01
Abstract:Multimodal Intent Detection is an important task for understanding human language in real-world multimodal scenarios, where the key is the representation and fusion of different modal information (e.g., language, visual, and audio). Most previous research focus on exploring modality fusion methods to improve model performance. However, the representation of different modalities presents a great challenge in the fusion stage due to the heterogeneity gap between them. To address this problem, the Language-Visual-Audio Mixture of Experts (LVAMoE) is proposed with the aim of minimizing the distribution gap between different modalities in the modal representation stage. Firstly, a dense encoder is used to obtain modality-invariant representations. Secondly, a sparse representation encoder with Mixture of Experts(MoE) is utilized to obtain modality-specific representations. Finally, multimodal interaction and fusion is achieved through a cross-modal attention approach combined with contrast learning. To validate the model performance, experiments are conducted on three datasets. The results show that LVAMoE outperforms the baseline model on several evaluation metrics.
What problem does this paper attempt to address?