Abstract:Humans express their emotions in a variety of ways, which inspires research on multimodal fusion-based emotion recognition that utilizes different modalities to achieve information complementation. However, extracting deep emotional features from different modalities and fusing them remain a challenging task. It is essential to exploit the advantages of different extraction and fusion approaches to capture the emotional information contained within and across modalities. In this paper, we present a novel multimodal emotion recognition framework called multimodal emotion recognition based on cascaded multichannel and hierarchical fusion (CMC-HF), where visual, speech, and text signals are simultaneously utilized as multimodal inputs. First, three cascaded channels based on deep learning technology perform feature extraction for the three modalities separately to enhance deeper information extraction ability within each modality and improve recognition performance. Second, an improved hierarchical fusion module is introduced to promote intermodality interactions of three modalities and further improve recognition and classification accuracy. Finally, to validate the effectiveness of the designed CMC-HF model, some experiments are conducted to evaluate two benchmark datasets, IEMOCAP and CMU-MOSI. The results show that we achieved an almost 2%∼3.2% increase in accuracy of the four classes for the IEMOCAP dataset as well as an improvement of 0.9%∼2.5% in the average class accuracy for the CMU-MOSI dataset when compared to the existing state-of-the-art methods. The ablation experimental results indicate that the cascaded feature extraction method and the hierarchical fusion method make a significant contribution to multimodal emotion recognition, suggesting that the three modalities contain deeper information interactions of both intermodality and intramodality. Hence, the proposed model has better overall performance and achieves higher recognition efficiency and better robustness.

Feature Fusion for Multimodal Emotion Recognition Based on Deep Canonical Correlation Analysis

Emotion Recognition in Videos via Fusing Multimodal Features.

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Multimodal Emotion Recognition Based on Cascaded Multichannel and Hierarchical Fusion

Multimodal Emotion Recognition Using Deep Generalized Canonical Correlation Analysis with an Attention Mechanism

Multimodal Emotion Recognition Using Deep Canonical Correlation Analysis

MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations

Multimodal emotion recognition from facial expression and speech based on feature fusion

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

MMGCN: Multimodal Fusion Via Deep Graph Convolution Network for Emotion Recognition in Conversation

Multi-Modal Emotion Recognition by Fusing Correlation Features of Speech-Visual

Audio-Visual Fusion Network Based on Conformer for Multimodal Emotion Recognition

Multimodal emotion recognition with capsule graph convolutional based representation fusion

K-Means Clustering-based Kernel Canonical Correlation Analysis for Multimodal Emotion Recognition

A Dual Attention-based Modality-Collaborative Fusion Network for Emotion Recognition

Multimodal Emotion Recognition Based on Feature Selection and Extreme Learning Machine in Video Clips.

Comparing Recognition Performance and Robustness of Multimodal Deep Learning Models for Multimodal Emotion Recognition

A Cross-Modal Fusion Network Based on Self-Attention and Residual Structure for Multimodal Emotion Recognition

Deep Fusion of Multi-Channel Neurophysiological Signal for Emotion Recognition and Monitoring

A multi-stage dynamical fusion network for multimodal emotion recognition

Fusion with Hierarchical Graphs for Multimodal Emotion Recognition