Abstract:Goal: As an essential human-machine interactive task, emotion recognition has become an emerging area over the decades. Although previous attempts to classify emotions have achieved high performance, several challenges remain open: 1) How to effectively recognize emotions using different modalities remains challenging. 2) Due to the increasing amount of computing power required for deep learning, how to provide real-time detection and improve the robustness of deep neural networks is important. Method: In this paper, we propose a deep learning-based multimodal emotion recognition (MER) called Deep-Emotion, which can adaptively integrate the most discriminating features from facial expressions, speech, and electroencephalogram (EEG) to improve the performance of the MER. Specifically, the proposed Deep-Emotion framework consists of three branches, i.e., the facial branch, speech branch, and EEG branch. Correspondingly, the facial branch uses the improved GhostNet neural network proposed in this paper for feature extraction, which effectively alleviates the overfitting phenomenon in the training process and improves the classification accuracy compared with the original GhostNet network. For work on the speech branch, this paper proposes a lightweight fully convolutional neural network (LFCNN) for the efficient extraction of speech emotion features. Regarding the study of EEG branches, we proposed a tree-like LSTM (tLSTM) model capable of fusing multi-stage features for EEG emotion feature extraction. Finally, we adopted the strategy of decision-level fusion to integrate the recognition results of the above three modes, resulting in more comprehensive and accurate performance. Result and Conclusions: Extensive experiments on the CK+, EMO-DB, and MAHNOB-HCI datasets have demonstrated the advanced nature of the Deep-Emotion method proposed in this paper, as well as the feasibility and superiority of the MER approach.

Leveraging Label Information for Multimodal Emotion Recognition

Multimodal emotion recognition based on audio and text by using hybrid attention networks

Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment

First-order Multi-label Learning with Cross-modal Interactions for Multimodal Emotion Recognition

Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition

A Versatile Multimodal Learning Framework For Zero-shot Emotion Recognition

Multimodal Emotion Recognition based on Facial Expressions, Speech, and EEG

Multiplex graph aggregation and feature refinement for unsupervised incomplete multimodal emotion recognition

A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face

Deep Imbalanced Learning for Multimodal Emotion Recognition in Conversations

Leveraging Retrieval Augment Approach for Multimodal Emotion Recognition Under Missing Modalities

LoRA-MER: Low-Rank Adaptation of Pre-Trained Speech Models for Multimodal Emotion Recognition Using Mutual Information

Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better

Audio-Guided Fusion Techniques for Multimodal Emotion Analysis

Fine-grained Disentangled Representation Learning for Multimodal Emotion Recognition

CARAT: Contrastive Feature Reconstruction and Aggregation for Multi-Modal Multi-Label Emotion Recognition

SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition

Multimodal Emotion Recognition with Vision-language Prompting and Modality Dropout

Multimodal emotion recognition from facial expression and speech based on feature fusion

Multi-Label Multimodal Emotion Recognition With Transformer-Based Fusion and Emotion-Level Representation Learning

A Two-Stage Multimodal Emotion Recognition Model Based on Graph Contrastive Learning