Abstract:Multimodal emotion recognition (MER), leveraging speech and text, has emerged as a pivotal domain within human-computer interaction, demanding sophisticated methods for effective multimodal integration. The challenge of aligning features across these modalities is significant, with most existing approaches adopting a singular alignment strategy. Such a narrow focus not only limits model performance but also fails to address the complexity and ambiguity inherent in emotional expressions. In response, this paper introduces a Multi-Granularity Cross-Modal Alignment (MGCMA) framework, distinguished by its comprehensive approach encompassing distribution-based, instance-based, and token-based alignment modules. This framework enables a multi-level perception of emotional information across modalities. Our experiments on IEMOCAP demonstrate that our proposed method outperforms current state-of-the-art techniques.

What problem does this paper attempt to address?

This paper attempts to solve the feature alignment problem in multimodal emotion recognition (MER). Specifically, most of the existing MER methods adopt a single alignment strategy, which not only limits the model performance but also fails to fully cope with the complexity and ambiguity in emotional expressions. To solve these problems, the author proposes a multi - granularity cross - modal alignment framework (Multi - Granularity Cross - Modal Alignment, MGCMA), aiming to achieve more comprehensive emotional information perception through distribution - level, instance - level and token - level alignment modules. ### Specific description of the problem 1. **Limitations of existing methods**: - **Single alignment strategy**: Most existing methods only adopt one alignment strategy, such as fine - grained alignment or coarse - grained alignment, which limits the performance of the model. - **Complexity and ambiguity of emotional expressions**: Emotional expressions are highly complex and ambiguous, and existing methods fail to fully handle these characteristics, resulting in a decline in alignment quality. 2. **Objectives**: - Propose a framework that can comprehensively handle alignments of different granularities to improve the accuracy of multimodal emotion recognition. - Solve the problems of complexity and ambiguity in emotional expressions and improve the overall performance of the model. ### Solution The MGCMA framework proposed by the author contains three main modules: 1. **Distribution - based Alignment Module**: - Achieve coarse - grained alignment through distribution - level contrastive learning to deal with the ambiguity of emotional expressions. - Use the multi - head self - attention mechanism to construct a multivariate Gaussian distribution and calculate the 2 - Wasserstein distance between the two distributions. 2. **Token - based Alignment Module**: - Achieve fine - grained alignment through self - attention and cross - attention mechanisms to promote local information exchange between different modal features. - The module consists of multiple blocks, and each block contains self - attention and cross - attention mechanisms. 3. **Instance - based Alignment Module**: - Achieve instance - level alignment through contrastive learning to enhance the mapping relationship between specific speech - text pairs. - Calculate the instance - level contrastive loss to ensure that the matched speech - text pairs are closer in the latent space and the unmatched pairs are farther. ### Experimental results The author conducted experiments on the IEMOCAP dataset, and the results show that the MGCMA framework outperforms the current state - of - the - art methods in both weighted accuracy (WA) and unweighted accuracy (UA), reaching 78.87% and 80.24% respectively. ### Formula display 1. **Multi - head self - attention mechanism**: \[ \text{Attention}(Q, K, V)=\text{softmax}\left(\frac{QK^{T}}{\sqrt{d}}\right)V \] \[ \text{head}_i = \text{Attention}(Q_iW_q^i, K_iW_k^i, V_iW_v^i) \] \[ \text{Concat}=[\text{head}_1,\text{head}_2,\ldots,\text{head}_k]W_o \] 2. **2 - Wasserstein distance**: \[ W(N_1, N_2)=||\mu_1 - \mu_2||_2^2+\text{Tr}(\

Enhancing Multimodal Emotion Recognition through Multi-Granularity Cross-Modal Alignment

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

A Efficient Multimodal Framework for Large Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals

Multimodal emotion recognition based on audio and text by using hybrid attention networks

Enhancing Emotion Recognition in Incomplete Data: A Novel Cross-Modal Alignment, Reconstruction, and Refinement Framework

Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition

Multimodal Emotion Recognition Based on Cascaded Multichannel and Hierarchical Fusion

Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment

Multimodal Emotion Recognition based on Facial Expressions, Speech, and EEG

A Multi-Level Alignment and Cross-Modal Unified Semantic Graph Refinement Network for Conversational Emotion Recognition

Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation

Multiplex graph aggregation and feature refinement for unsupervised incomplete multimodal emotion recognition

Target and Source Modality Co-Reinforcement for Emotion Understanding from Asynchronous Multimodal Sequences.

Multimodal emotion recognition from facial expression and speech based on feature fusion

Multimodal Emotion Recognition Based on Facial Expressions, Speech, and Body Gestures

MultiEMO: an Attention-Based Correlation-Aware Multimodal Fusion Framework for Emotion Recognition in Conversations.

Research on cross-modal emotion recognition based on multi-layer semantic fusion

CMATH: Cross-Modality Augmented Transformer with Hierarchical Variational Distillation for Multimodal Emotion Recognition in Conversation

Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition

cross-modal fusion techniques for utterance-level emotion recognition from text and speech

On the Performance of Blanking Nonlinearity in Real-Valued OFDM-Based PLC