Abstract:Goal: As an essential human-machine interactive task, emotion recognition has become an emerging area over the decades. Although previous attempts to classify emotions have achieved high performance, several challenges remain open: 1) How to effectively recognize emotions using different modalities remains challenging. 2) Due to the increasing amount of computing power required for deep learning, how to provide real-time detection and improve the robustness of deep neural networks is important. Method: In this paper, we propose a deep learning-based multimodal emotion recognition (MER) called Deep-Emotion, which can adaptively integrate the most discriminating features from facial expressions, speech, and electroencephalogram (EEG) to improve the performance of the MER. Specifically, the proposed Deep-Emotion framework consists of three branches, i.e., the facial branch, speech branch, and EEG branch. Correspondingly, the facial branch uses the improved GhostNet neural network proposed in this paper for feature extraction, which effectively alleviates the overfitting phenomenon in the training process and improves the classification accuracy compared with the original GhostNet network. For work on the speech branch, this paper proposes a lightweight fully convolutional neural network (LFCNN) for the efficient extraction of speech emotion features. Regarding the study of EEG branches, we proposed a tree-like LSTM (tLSTM) model capable of fusing multi-stage features for EEG emotion feature extraction. Finally, we adopted the strategy of decision-level fusion to integrate the recognition results of the above three modes, resulting in more comprehensive and accurate performance. Result and Conclusions: Extensive experiments on the CK+, EMO-DB, and MAHNOB-HCI datasets have demonstrated the advanced nature of the Deep-Emotion method proposed in this paper, as well as the feasibility and superiority of the MER approach.

Speech Expression Multimodal Emotion Recognition Based on Deep Belief Network

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

Emotion Recognition in Videos via Fusing Multimodal Features.

Emotion recognition using multimodal deep learning in multiple psychophysiological signals and video

MF-Net: a multimodal fusion network for emotion recognition based on multiple physiological signals

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

A multi-stage dynamical fusion network for multimodal emotion recognition

Multimodal Emotion Recognition From EEG Signals and Facial Expressions

Multimodal Emotional Classification Based on Meaningful Learning

Multi-head attention fusion networks for multi-modal speech emotion recognition

Multimodal Emotion Recognition based on Facial Expressions, Speech, and EEG

Multimodal emotion recognition from facial expression and speech based on feature fusion

Multimodal Emotion Recognition based on the Fusion of EEG Signals and Eye Movement Data

A novel feature fusion network for multimodal emotion recognition from EEG and eye movement signals

Multi-modal fusion network with complementarity and importance for emotion recognition

Multimodal Emotion Recognition Using a Hierarchical Fusion Convolutional Neural Network

A Three-stage multimodal emotion recognition network based on text low-rank fusion

MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations

Coupled Multimodal Emotional Feature Analysis Based on Broad-Deep Fusion Networks in Human-Robot Interaction

MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition

Speech emotion recognition based on multi-dimensional feature extraction and multi-scale feature fusion