Abstract:Introduction: Multimodal emotion recognition has become a hot topic in human-computer interaction and intelligent healthcare fields. However, combining information from different human different modalities for emotion computation is still challenging. Methods: In this paper, we propose a three-dimensional convolutional recurrent neural network model (referred to as 3FACRNN network) based on multimodal fusion and attention mechanism. The 3FACRNN network model consists of a visual network and an EEG network. The visual network is composed of a cascaded convolutional neural network–time convolutional network (CNN-TCN). In the EEG network, the 3D feature building module was added to integrate band information, spatial information and temporal information of the EEG signal, and the band attention and self-attention modules were added to the convolutional recurrent neural network (CRNN). The former explores the effect of different frequency bands on network recognition performance, while the latter is to obtain the intrinsic similarity of different EEG samples. Results: To investigate the effect of different frequency bands on the experiment, we obtained the average attention mask for all subjects in different frequency bands. The distribution of the attention masks across the different frequency bands suggests that signals more relevant to human emotions may be active in the high frequency bands γ (31–50 Hz). Finally, we try to use the multi-task loss function Lc to force the approximation of the intermediate feature vectors of the visual and EEG modalities, with the aim of using the knowledge of the visual modalities to improve the performance of the EEG network model. The mean recognition accuracy and standard deviation of the proposed method on the two multimodal sentiment datasets DEAP and MAHNOB-HCI (arousal, valence) were 96.75 ± 1.75, 96.86 ± 1.33; 97.55 ± 1.51, 98.37 ± 1.07, better than those of the state-of-the-art multimodal recognition approaches. Discussion: The experimental results show that starting from the multimodal information, the facial video frames and electroencephalogram (EEG) signals of the subjects are used as inputs to the emotion recognition network, which can enhance the stability of the emotion network and improve the recognition accuracy of the emotion network. In addition, in future work, we will try to utilize sparse matrix methods and deep convolutional networks to improve the performance of multimodal emotion networks.

Multimodal Emotion Recognition Using Deep Generalized Canonical Correlation Analysis with an Attention Mechanism

Multimodal Emotion Recognition Using Deep Canonical Correlation Analysis

A Efficient Multimodal Framework for Large Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals

Comparing Recognition Performance and Robustness of Multimodal Deep Learning Models for Multimodal Emotion Recognition

Multi-mode Emotion Recognition Based on Generalized Discriminative Canonical Correlation Analysis

Feature Fusion for Multimodal Emotion Recognition Based on Deep Canonical Correlation Analysis

Multimodal Emotion Recognition Based on Cascaded Multichannel and Hierarchical Fusion

Dense Graph Convolutional with Joint Cross-Attention Network for Multimodal Emotion Recognition

Emotion Recognition From Multimodal Physiological Signals via Discriminative Correlation Fusion With a Temporal Alignment Mechanism

Multi-modal fusion network with complementarity and importance for emotion recognition

K-Means Clustering-based Kernel Canonical Correlation Analysis for Multimodal Emotion Recognition

An Improved Multimodal Dimension Emotion Recognition Based on Different Fusion Methods

Multi-Modality Emotion Recognition Model with GAT-Based Multi-Head Inter-Modality Attention

Attention-based 3D convolutional recurrent neural network model for multimodal emotion recognition

Multimodal Emotion Recognition Using a Modified Dense Co-Attention Symmetric Network

Multimodal Emotion Recognition by Extracting Common and Modality-Specific Information.

Multimodal emotion recognition from facial expression and speech based on feature fusion

A Dual Attention-based Modality-Collaborative Fusion Network for Emotion Recognition

Multi-channel Weight-sharing Autoencoder Based on Cascade Multi-head Attention for Multimodal Emotion Recognition

Multimodal Emotion Recognition Based on Feature Selection and Extreme Learning Machine in Video Clips.

Multimodal Emotion Recognition Using Multimodal Deep Learning