Abstract:Goal: As an essential human-machine interactive task, emotion recognition has become an emerging area over the decades. Although previous attempts to classify emotions have achieved high performance, several challenges remain open: 1) How to effectively recognize emotions using different modalities remains challenging. 2) Due to the increasing amount of computing power required for deep learning, how to provide real-time detection and improve the robustness of deep neural networks is important. Method: In this paper, we propose a deep learning-based multimodal emotion recognition (MER) called Deep-Emotion, which can adaptively integrate the most discriminating features from facial expressions, speech, and electroencephalogram (EEG) to improve the performance of the MER. Specifically, the proposed Deep-Emotion framework consists of three branches, i.e., the facial branch, speech branch, and EEG branch. Correspondingly, the facial branch uses the improved GhostNet neural network proposed in this paper for feature extraction, which effectively alleviates the overfitting phenomenon in the training process and improves the classification accuracy compared with the original GhostNet network. For work on the speech branch, this paper proposes a lightweight fully convolutional neural network (LFCNN) for the efficient extraction of speech emotion features. Regarding the study of EEG branches, we proposed a tree-like LSTM (tLSTM) model capable of fusing multi-stage features for EEG emotion feature extraction. Finally, we adopted the strategy of decision-level fusion to integrate the recognition results of the above three modes, resulting in more comprehensive and accurate performance. Result and Conclusions: Extensive experiments on the CK+, EMO-DB, and MAHNOB-HCI datasets have demonstrated the advanced nature of the Deep-Emotion method proposed in this paper, as well as the feasibility and superiority of the MER approach.

Language-guided Multi-modal Emotional Mimicry Intensity Estimation

Efficient Feature Extraction and Late Fusion Strategy for Audiovisual Emotional Mimicry Intensity Estimation

Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction

A Efficient Multimodal Framework for Large Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals

Deep Spectrum Feature Representations for Speech Emotion Recognition

Exploring the Power of Cross-Contextual Large Language Model in Mimic Emotion Prediction

Multi-modal Emotion Reaction Intensity Estimation with Temporal Augmentation.

Bridging the Emotional Semantic Gap via Multimodal Relevance Estimation

Affective Behaviour Analysis via Integrating Multi-Modal Knowledge

An Effective Ensemble Learning Framework for Affective Behaviour Analysis

A Novel Emotion-Aware Method Based on the Fusion of Textual Description of Speech, Body Movements, and Facial Expressions

Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Integrating Holistic and Local Information to Estimate Emotional Reaction Intensity

Mutilmodal Feature Extraction and Attention-based Fusion for Emotion Estimation in Videos

MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

Emotional Reaction Intensity Estimation Based on Multimodal Data

EffMulti: Efficiently Modeling Complex Multimodal Interactions for Emotion Analysis

MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Micro-Expression Dynamics in Video Dialogues

Multimodal Feature Extraction and Fusion for Emotional Reaction Intensity Estimation and Expression Classification in Videos with Transformers

Multimodal Emotion Recognition based on Facial Expressions, Speech, and EEG