Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition

Qifei Li,Yingming Gao,Yuhua Wen,Cong Wang,Ya Li
2024-08-18
Abstract:To address the limitation in multimodal emotion recognition (MER) performance arising from inter-modal information fusion, we propose a novel MER framework based on multitask learning where fusion occurs after alignment, called Foal-Net. The framework is designed to enhance the effectiveness of modality fusion and includes two auxiliary tasks: audio-video emotion alignment (AVEL) and cross-modal emotion label matching (MEM). First, AVEL achieves alignment of emotional information in audio-video representations through contrastive learning. Then, a modal fusion network integrates the aligned features. Meanwhile, MEM assesses whether the emotions of the current sample pair are the same, providing assistance for modal information fusion and guiding the model to focus more on emotional information. The experimental results conducted on IEMOCAP corpus show that Foal-Net outperforms the state-of-the-art methods and emotion alignment is necessary before modal fusion.
Sound,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the issue of modality fusion in Multimodal Emotion Recognition (MER). Specifically, the authors propose a new framework—Foal-Net, which is based on multi-task learning and enhances the effect of modality fusion through two auxiliary tasks: 1. **Audio-Video Emotion Alignment (AVEL)**: By using contrastive learning to align emotional information between audio and video representations, it increases the similarity between sample pairs with the same emotional category and decreases the similarity between sample pairs with different emotional categories. 2. **Cross-Modal Emotion Label Matching (MEM)**: A binary classification task is designed to determine whether the emotional labels of the current input sample pair are consistent, thereby promoting the fusion of modality information and guiding the model to focus more on emotional information. Experimental results show that Foal-Net outperforms existing state-of-the-art methods on the IEMOCAP dataset, demonstrating the importance of emotion alignment before modality fusion and the effectiveness of auxiliary tasks. Additionally, the paper verifies the superiority of CLIP embeddings in emotion recognition tasks.