Abstract:Continuous emotion recognition has been a compelling topic in affective computing because it can interpret human emotions subtly and continuously. Existing studies have achieved advanced emotion recognition performance using multimodal knowledge. However, these studies generally ignore the circumstances where some particular modalities are missing in the inference phase and thus become sensitive to the absence of modalities. To resolve this issue, we propose a novel multimodal shared network with a cross-modal distribution constraint, i.e. the DS-Net, which aims to improve the robustness of the model to missing modalities. The training process of the proposed network generally includes two components: multimodal shared space modeling and a cross-modal distribution matching constraint. The former utilizes the local and temporal information of multimodal signals for multimodal shared space modeling, while the latter further enhances the multimodal shared space via a loose constraint method. Coupled with the latter, the former can effectively exploit the complementarity between videos and peripheral physiological signals (PPSs), thus enhancing the discriminative capability of the shared space. Based on the shared space, the DS-Net works during the inference phase with only one modality input and can leverage multimodal knowledge to improve emotion recognition accuracy. Comprehensive experiments were conducted on two public datasets. Results demonstrate that the proposed method is competitive or superior to the current state-of-the-art methods. Further experiments indicate that the proposed method can be extended to handle other modalities and to deal with partially missing modalities, demonstrating its potential in real-world applications.

Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

A Efficient Multimodal Framework for Large Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals

Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks

Long Short Term Memory Recurrent Neural Network Based Multimodal Dimensional Emotion Recognition

Multi-modal Continuous Dimensional Emotion Recognition Using Recurrent Neural Network and Self-Attention Mechanism

Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction

Multimodal Continuous Prediction of Emotions in Movies using Long Short-Term Memory Networks

An Improved Multimodal Dimension Emotion Recognition Based on Different Fusion Methods

Multi-scale Temporal Modeling for Dimensional Emotion Recognition in Video

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Multimodal Transformer Fusion for Continuous Emotion Recognition

A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videos

End-to-End Continuous Emotion Recognition from Video Using 3D Convlstm Networks

Multimodal Sentiment Analysis Based on Recurrent Neural Network and Multimodal Attention.

Multi-resolution modulation-filtered cochleagram feature for LSTM-based dimensional emotion recognition from speech

Time-Delay Neural Network for Continuous Emotional Dimension Prediction from Facial Expression Sequences.

Residual multimodal Transformer for expression‐EEG fusion continuous emotion recognition

Continuous Emotion Recognition with Audio-visual Leader-follower Attentive Fusion

A multimodal shared network with a cross-modal distribution constraint for continuous emotion recognition

Multitask Learning and Multistage Fusion for Dimensional Audiovisual Emotion Recognition