Abstract:Goal: As an essential human-machine interactive task, emotion recognition has become an emerging area over the decades. Although previous attempts to classify emotions have achieved high performance, several challenges remain open: 1) How to effectively recognize emotions using different modalities remains challenging. 2) Due to the increasing amount of computing power required for deep learning, how to provide real-time detection and improve the robustness of deep neural networks is important. Method: In this paper, we propose a deep learning-based multimodal emotion recognition (MER) called Deep-Emotion, which can adaptively integrate the most discriminating features from facial expressions, speech, and electroencephalogram (EEG) to improve the performance of the MER. Specifically, the proposed Deep-Emotion framework consists of three branches, i.e., the facial branch, speech branch, and EEG branch. Correspondingly, the facial branch uses the improved GhostNet neural network proposed in this paper for feature extraction, which effectively alleviates the overfitting phenomenon in the training process and improves the classification accuracy compared with the original GhostNet network. For work on the speech branch, this paper proposes a lightweight fully convolutional neural network (LFCNN) for the efficient extraction of speech emotion features. Regarding the study of EEG branches, we proposed a tree-like LSTM (tLSTM) model capable of fusing multi-stage features for EEG emotion feature extraction. Finally, we adopted the strategy of decision-level fusion to integrate the recognition results of the above three modes, resulting in more comprehensive and accurate performance. Result and Conclusions: Extensive experiments on the CK+, EMO-DB, and MAHNOB-HCI datasets have demonstrated the advanced nature of the Deep-Emotion method proposed in this paper, as well as the feasibility and superiority of the MER approach.

VCEMO: Multi-Modal Emotion Recognition for Chinese Voiceprints

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

MEC 2016: The Multimodal Emotion Recognition Challenge of CCPR 2016.

Open-vocabulary Multimodal Emotion Recognition: Dataset, Metric, and Benchmark

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Emotion Recognition With Audio, Video, EEG, and EMG: A Dataset and Baseline Approaches

Multimodal Emotion Recognition by Extracting Common and Modality-Specific Information.

Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition

Multi-head attention fusion networks for multi-modal speech emotion recognition

M3ED: Multi-modal Multi-scene Multi-label Emotional Dialogue Database

Construction and Evaluation of Mandarin Multimodal Emotional Speech Database

MES-P: an Emotional Tonal Speech Dataset in Mandarin Chinese with Distal and Proximal Labels

Multimodal Emotion Recognition based on Facial Expressions, Speech, and EEG

A multimodal emotion recognition model integrating speech, video and MoCAP

Emotion Inferring from Large-scale Internet Voice Data: A Multimodal Deep Learning Approach

EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model

EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark

CHEAVD: a Chinese natural emotional audio–visual database

MERBench: A Unified Evaluation Benchmark for Multimodal Emotion Recognition

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Defective forebrain development in mice lacking gp330/megalin.