Abstract:Humans express their emotions in a variety of ways, which inspires research on multimodal fusion-based emotion recognition that utilizes different modalities to achieve information complementation. However, extracting deep emotional features from different modalities and fusing them remain a challenging task. It is essential to exploit the advantages of different extraction and fusion approaches to capture the emotional information contained within and across modalities. In this paper, we present a novel multimodal emotion recognition framework called multimodal emotion recognition based on cascaded multichannel and hierarchical fusion (CMC-HF), where visual, speech, and text signals are simultaneously utilized as multimodal inputs. First, three cascaded channels based on deep learning technology perform feature extraction for the three modalities separately to enhance deeper information extraction ability within each modality and improve recognition performance. Second, an improved hierarchical fusion module is introduced to promote intermodality interactions of three modalities and further improve recognition and classification accuracy. Finally, to validate the effectiveness of the designed CMC-HF model, some experiments are conducted to evaluate two benchmark datasets, IEMOCAP and CMU-MOSI. The results show that we achieved an almost 2%∼3.2% increase in accuracy of the four classes for the IEMOCAP dataset as well as an improvement of 0.9%∼2.5% in the average class accuracy for the CMU-MOSI dataset when compared to the existing state-of-the-art methods. The ablation experimental results indicate that the cascaded feature extraction method and the hierarchical fusion method make a significant contribution to multimodal emotion recognition, suggesting that the three modalities contain deeper information interactions of both intermodality and intramodality. Hence, the proposed model has better overall performance and achieves higher recognition efficiency and better robustness.

Cross-modal Features Interaction-and-Aggregation Network with Self-consistency Training for Speech Emotion Recognition

Self-attention Transfer Networks for Speech Emotion Recognition

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

MULTIMODAL CROSS- AND SELF-ATTENTION NETWORK FOR SPEECH EMOTION RECOGNITION

A bimodal network based on Audio-Text-Interactional-Attention with ArcFace loss for speech emotion recognition

A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition

GCF2-Net: global-aware cross-modal feature fusion network for speech emotion recognition

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

Head Fusion: Improving the Accuracy and Robustness of Speech Emotion Recognition on the IEMOCAP and RAVDESS Dataset

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Learning multi-scale features for speech emotion recognition with connection attention mechanism

MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition

Speech Emotion Recognition Method Based on Cross-Layer Intersectant Fusion

Speaker-aware Cross-modal Fusion Architecture for Conversational Emotion Recognition

Speaker-aware cognitive network with cross-modal attention for multimodal emotion recognition in conversation

Improve Accuracy of Speech Emotion Recognition with Attention Head Fusion

Multimodal Emotion Recognition Based on Cascaded Multichannel and Hierarchical Fusion

Multimodal emotion recognition from facial expression and speech based on feature fusion

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition

Multi-level attention fusion network assisted by relative entropy alignment for multimodal speech emotion recognition