Abstract:Humans express their emotions in a variety of ways, which inspires research on multimodal fusion-based emotion recognition that utilizes different modalities to achieve information complementation. However, extracting deep emotional features from different modalities and fusing them remain a challenging task. It is essential to exploit the advantages of different extraction and fusion approaches to capture the emotional information contained within and across modalities. In this paper, we present a novel multimodal emotion recognition framework called multimodal emotion recognition based on cascaded multichannel and hierarchical fusion (CMC-HF), where visual, speech, and text signals are simultaneously utilized as multimodal inputs. First, three cascaded channels based on deep learning technology perform feature extraction for the three modalities separately to enhance deeper information extraction ability within each modality and improve recognition performance. Second, an improved hierarchical fusion module is introduced to promote intermodality interactions of three modalities and further improve recognition and classification accuracy. Finally, to validate the effectiveness of the designed CMC-HF model, some experiments are conducted to evaluate two benchmark datasets, IEMOCAP and CMU-MOSI. The results show that we achieved an almost 2%∼3.2% increase in accuracy of the four classes for the IEMOCAP dataset as well as an improvement of 0.9%∼2.5% in the average class accuracy for the CMU-MOSI dataset when compared to the existing state-of-the-art methods. The ablation experimental results indicate that the cascaded feature extraction method and the hierarchical fusion method make a significant contribution to multimodal emotion recognition, suggesting that the three modalities contain deeper information interactions of both intermodality and intramodality. Hence, the proposed model has better overall performance and achieves higher recognition efficiency and better robustness.

First-order Multi-label Learning with Cross-modal Interactions for Multimodal Emotion Recognition

A Versatile Multimodal Learning Framework For Zero-shot Emotion Recognition

Leveraging Label Information for Multimodal Emotion Recognition

CMATH: Cross-Modality Augmented Transformer with Hierarchical Variational Distillation for Multimodal Emotion Recognition in Conversation

Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better

Deep Imbalanced Learning for Multimodal Emotion Recognition in Conversations

Multi-Label Multimodal Emotion Recognition With Transformer-Based Fusion and Emotion-Level Representation Learning

Multimodal emotion recognition based on audio and text by using hybrid attention networks

Multimodal Prompt Transformer with Hybrid Contrastive Learning for Emotion Recognition in Conversation

Multimodal Emotion Recognition Based on Cascaded Multichannel and Hierarchical Fusion

Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition

cross-modal fusion techniques for utterance-level emotion recognition from text and speech

Multiplex graph aggregation and feature refinement for unsupervised incomplete multimodal emotion recognition

MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning.

An Improved Multimodal Dimension Emotion Recognition Based on Different Fusion Methods

Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment

Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation

Research on cross-modal emotion recognition based on multi-layer semantic fusion

Multimodal Emotion Recognition based on Facial Expressions, Speech, and EEG

MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning

Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023