Enhancing Multimodal Emotion Recognition through Multi-Granularity Cross-Modal Alignment

Xuechen Wang,Shiwan Zhao,Haoqin Sun,Hui Wang,Jiaming Zhou,Yong Qin
2024-12-30
Abstract:Multimodal emotion recognition (MER), leveraging speech and text, has emerged as a pivotal domain within human-computer interaction, demanding sophisticated methods for effective multimodal integration. The challenge of aligning features across these modalities is significant, with most existing approaches adopting a singular alignment strategy. Such a narrow focus not only limits model performance but also fails to address the complexity and ambiguity inherent in emotional expressions. In response, this paper introduces a Multi-Granularity Cross-Modal Alignment (MGCMA) framework, distinguished by its comprehensive approach encompassing distribution-based, instance-based, and token-based alignment modules. This framework enables a multi-level perception of emotional information across modalities. Our experiments on IEMOCAP demonstrate that our proposed method outperforms current state-of-the-art techniques.
Audio and Speech Processing,Computation and Language,Sound
What problem does this paper attempt to address?
This paper attempts to solve the feature alignment problem in multimodal emotion recognition (MER). Specifically, most of the existing MER methods adopt a single alignment strategy, which not only limits the model performance but also fails to fully cope with the complexity and ambiguity in emotional expressions. To solve these problems, the author proposes a multi - granularity cross - modal alignment framework (Multi - Granularity Cross - Modal Alignment, MGCMA), aiming to achieve more comprehensive emotional information perception through distribution - level, instance - level and token - level alignment modules. ### Specific description of the problem 1. **Limitations of existing methods**: - **Single alignment strategy**: Most existing methods only adopt one alignment strategy, such as fine - grained alignment or coarse - grained alignment, which limits the performance of the model. - **Complexity and ambiguity of emotional expressions**: Emotional expressions are highly complex and ambiguous, and existing methods fail to fully handle these characteristics, resulting in a decline in alignment quality. 2. **Objectives**: - Propose a framework that can comprehensively handle alignments of different granularities to improve the accuracy of multimodal emotion recognition. - Solve the problems of complexity and ambiguity in emotional expressions and improve the overall performance of the model. ### Solution The MGCMA framework proposed by the author contains three main modules: 1. **Distribution - based Alignment Module**: - Achieve coarse - grained alignment through distribution - level contrastive learning to deal with the ambiguity of emotional expressions. - Use the multi - head self - attention mechanism to construct a multivariate Gaussian distribution and calculate the 2 - Wasserstein distance between the two distributions. 2. **Token - based Alignment Module**: - Achieve fine - grained alignment through self - attention and cross - attention mechanisms to promote local information exchange between different modal features. - The module consists of multiple blocks, and each block contains self - attention and cross - attention mechanisms. 3. **Instance - based Alignment Module**: - Achieve instance - level alignment through contrastive learning to enhance the mapping relationship between specific speech - text pairs. - Calculate the instance - level contrastive loss to ensure that the matched speech - text pairs are closer in the latent space and the unmatched pairs are farther. ### Experimental results The author conducted experiments on the IEMOCAP dataset, and the results show that the MGCMA framework outperforms the current state - of - the - art methods in both weighted accuracy (WA) and unweighted accuracy (UA), reaching 78.87% and 80.24% respectively. ### Formula display 1. **Multi - head self - attention mechanism**: \[ \text{Attention}(Q, K, V)=\text{softmax}\left(\frac{QK^{T}}{\sqrt{d}}\right)V \] \[ \text{head}_i = \text{Attention}(Q_iW_q^i, K_iW_k^i, V_iW_v^i) \] \[ \text{Concat}=[\text{head}_1,\text{head}_2,\ldots,\text{head}_k]W_o \] 2. **2 - Wasserstein distance**: \[ W(N_1, N_2)=||\mu_1 - \mu_2||_2^2+\text{Tr}(\