Multi-level attention fusion network assisted by relative entropy alignment for multimodal speech emotion recognition

Jianjun Lei,Jing Wang,Ying Wang
DOI: https://doi.org/10.1007/s10489-024-05630-8
IF: 5.3
2024-06-27
Applied Intelligence
Abstract:Multimodal speech emotion recognition can utilize features from different modalities simultaneously to improve the modeling capabilities in affective computing. However, the rough feature combining method may not effectively promote interaction and facilitate learning between different modalities. This paper proposes a novel multimodal Speech Emotion Recognition (SER) framework that maps two feature vectors from different modalities to the same feature space by a Relative Entropy Alignment (REA) mechanism, which facilitates the modalities to complement mutually in learning emotional representations. Specifically, we employ KLD (Kullback-Leibler Divergence) to perform the feature alignment, capturing the temporal correlation between audio and text while ensuring that the features of both modalities tend to align in the feature space, thereby alleviating modality conflicts. Meanwhile, we construct a Multi-level Attention Fusion (MAF) mechanism to capture the emotion representations from different modalities, facilitating information exchange between different modalities while mitigating some redundant features. Furthermore, we extract multi-level acoustic information by wavelet packet transform to enrich the audio modality's emotional features further. Experimental results on some multimodal emotion datasets, such as Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Multi-modal Emotion Lines Dataset (MELD), demonstrate that our proposed method outperforms the state-of-the-art model.
computer science, artificial intelligence
What problem does this paper attempt to address?