Abstract:Humans express their emotions in a variety of ways, which inspires research on multimodal fusion-based emotion recognition that utilizes different modalities to achieve information complementation. However, extracting deep emotional features from different modalities and fusing them remain a challenging task. It is essential to exploit the advantages of different extraction and fusion approaches to capture the emotional information contained within and across modalities. In this paper, we present a novel multimodal emotion recognition framework called multimodal emotion recognition based on cascaded multichannel and hierarchical fusion (CMC-HF), where visual, speech, and text signals are simultaneously utilized as multimodal inputs. First, three cascaded channels based on deep learning technology perform feature extraction for the three modalities separately to enhance deeper information extraction ability within each modality and improve recognition performance. Second, an improved hierarchical fusion module is introduced to promote intermodality interactions of three modalities and further improve recognition and classification accuracy. Finally, to validate the effectiveness of the designed CMC-HF model, some experiments are conducted to evaluate two benchmark datasets, IEMOCAP and CMU-MOSI. The results show that we achieved an almost 2%∼3.2% increase in accuracy of the four classes for the IEMOCAP dataset as well as an improvement of 0.9%∼2.5% in the average class accuracy for the CMU-MOSI dataset when compared to the existing state-of-the-art methods. The ablation experimental results indicate that the cascaded feature extraction method and the hierarchical fusion method make a significant contribution to multimodal emotion recognition, suggesting that the three modalities contain deeper information interactions of both intermodality and intramodality. Hence, the proposed model has better overall performance and achieves higher recognition efficiency and better robustness.

CSAT-FTCN: A Fuzzy-Oriented Model with Contextual Self-attention Network for Multimodal Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

A Contextual Attention Network for Multimodal Emotion Recognition in Conversation

Emotion Recognition via Environmental Context and Human Body

Multimodal Sentiment Analysis Using Multi-tensor Fusion Network with Cross-modal Modeling

Context-Dependent Multimodal Sentiment Analysis Based on a Complex Attention Mechanism

A multimodal shared network with a cross-modal distribution constraint for continuous emotion recognition

Multi-Modality Emotion Recognition Model with GAT-Based Multi-Head Inter-Modality Attention

A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videos

MLGAT: multi-layer graph attention networks for multimodal emotion recognition in conversations

MF-Net: a multimodal fusion network for emotion recognition based on multiple physiological signals

Feature Extraction Network with Attention Mechanism for Data Enhancement and Recombination Fusion for Multimodal Sentiment Analysis

TeFNA: Text-centered Fusion Network with crossmodal Attention for multimodal sentiment analysis

Multi-head attention fusion networks for multi-modal speech emotion recognition

TSCL-FHFN: two-stage contrastive learning and feature hierarchical fusion network for multimodal sentiment analysis

Multi-modal fusion network with complementarity and importance for emotion recognition

Multimodal Emotion Recognition Based on Cascaded Multichannel and Hierarchical Fusion

A multi-stage dynamical fusion network for multimodal emotion recognition

DGFN Multimodal Emotion Analysis Model Based on Dynamic Graph Fusion Network

Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition