Abstract:Humans express their emotions in a variety of ways, which inspires research on multimodal fusion-based emotion recognition that utilizes different modalities to achieve information complementation. However, extracting deep emotional features from different modalities and fusing them remain a challenging task. It is essential to exploit the advantages of different extraction and fusion approaches to capture the emotional information contained within and across modalities. In this paper, we present a novel multimodal emotion recognition framework called multimodal emotion recognition based on cascaded multichannel and hierarchical fusion (CMC-HF), where visual, speech, and text signals are simultaneously utilized as multimodal inputs. First, three cascaded channels based on deep learning technology perform feature extraction for the three modalities separately to enhance deeper information extraction ability within each modality and improve recognition performance. Second, an improved hierarchical fusion module is introduced to promote intermodality interactions of three modalities and further improve recognition and classification accuracy. Finally, to validate the effectiveness of the designed CMC-HF model, some experiments are conducted to evaluate two benchmark datasets, IEMOCAP and CMU-MOSI. The results show that we achieved an almost 2%∼3.2% increase in accuracy of the four classes for the IEMOCAP dataset as well as an improvement of 0.9%∼2.5% in the average class accuracy for the CMU-MOSI dataset when compared to the existing state-of-the-art methods. The ablation experimental results indicate that the cascaded feature extraction method and the hierarchical fusion method make a significant contribution to multimodal emotion recognition, suggesting that the three modalities contain deeper information interactions of both intermodality and intramodality. Hence, the proposed model has better overall performance and achieves higher recognition efficiency and better robustness.

Speech Emotion Recognition Based on Three-Channel Feature Fusion of CNN and BiLSTM.

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Emotion Recognition in Videos via Fusing Multimodal Features.

Deep Spectrum Feature Representations for Speech Emotion Recognition

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

An autoencoder-based feature level fusion for speech emotion recognition

Information Fusion in Attention Networks Using Adaptive and Multi-level Factorized Bilinear Pooling for Audio-visual Emotion Recognition

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Speech emotion recognition based on multi-dimensional feature extraction and multi-scale feature fusion

Exploiting EEG signals and audiovisual feature fusion for video emotion recognition

Combined CNN LSTM with attention for speech emotion recognition based on feature-level fusion

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Speech emotion recognition using feature fusion: a hybrid approach to deep learning

Speech Emotion Recognition Based on Multi-feature and Multi-lingual Fusion

Speech emotion recognition using deep 1D & 2D CNN LSTM networks

Multimodal Emotion Recognition Based on Cascaded Multichannel and Hierarchical Fusion

Audio-Visual Fusion Network Based on Conformer for Multimodal Emotion Recognition

Spontaneous Speech Emotion Recognition Using Multiscale Deep Convolutional LSTM

A Novel Dual-Modal Emotion Recognition Algorithm with Fusing Hybrid Features of Audio Signal and Speech Context

A novel feature fusion network for multimodal emotion recognition from EEG and eye movement signals