Abstract:The immanent dependencies between audio and visual modalities extracted from video content and the well-established film grammar (i.e., domain knowledge) are important for emotion video recognition and regression. However, these tools have yet to be exploited successfully. Therefore, we propose a multimodal deep regression Bayesian network (MMDRBN) to capture the relationship between audio and visual modalities for emotion video tagging. We then modify the structure of the MMDRBN to incorporate domain knowledge. A regression Bayesian network (RBN) is formed from one latent layer, one visible layer and directed links from the latent layer to the visible layer. RBN is able to fully represent the data, since it captures the dependencies not only among the visible variables but also among the latent variables given visible variables. For the MMDRBN, first, we learn several layers of RBNs using audio and visual modalities, and then stack these RBNs to form two deep networks. A joint representation is obtained from the top layers of the two deep networks, capturing the deep dependencies between audio and visual modalities. We also summarize the main audio and visual elements used by filmmakers to convey emotions and formulate them as semantical meaningful middle-level representation, i.e., attributes. Through these attributes, we construct the knowledge-augmented MMDRBN, which learns a hybrid middle-level video representation using video data and the summarized attributes. Experimental results of both emotion recognition and regression from videos on the LIRIS-ACCEDE database demonstrate that the proposed model can successfully capture the intrinsic connections between audio and visual modalities, and integrate the middle-level representation learning from video data and semantical attributes summarized from film grammar. Thus, it achieves superior performance on emotion video tagging compared to state-of-the-art methods.

Multimodal Information-Based Broad and Deep Learning Model for Emotion Understanding

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Coupled Multimodal Emotional Feature Analysis Based on Broad-Deep Fusion Networks in Human-Robot Interaction

Emotion recognition using multimodal deep learning in multiple psychophysiological signals and video

Convolutional Features-Based Broad Learning with LSTM for Multidimensional Facial Emotion Recognition in Human–Robot Interaction

Multimodal Emotional Classification Based on Meaningful Learning

Multimodal Emotion Recognition based on Facial Expressions, Speech, and EEG

Speech Expression Multimodal Emotion Recognition Based on Deep Belief Network

A Novel Supervised Bimodal Emotion Recognition Approach Based on Facial Expression and Body Gesture.

Multimodal Emotion Recognition From EEG Signals and Facial Expressions

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

A multimodal emotion recognition model integrating speech, video and MoCAP

Multimodal Emotion Recognition Using Multimodal Deep Learning

A Novel Emotion-Aware Method Based on the Fusion of Textual Description of Speech, Body Movements, and Facial Expressions

Multimodal Fused Emotion Recognition about Expression-EEG Interaction and Collaboration Using Deep Learning

Multimodal Emotion Recognition Using Deep Neural Networks.

Multimodal Emotion Recognition Based on Feature Selection and Extreme Learning Machine in Video Clips.

Emotion recognition using heterogeneous convolutional neural networks combined with multimodal factorized bilinear pooling

Knowledge-Augmented Multimodal Deep Regression Bayesian Networks for Emotion Video Tagging

Multimodal Emotion Recognition with Factorized Bilinear Pooling and Adversarial Learning.

Expression EEG Multimodal Emotion Recognition Method Based on the Bidirectional LSTM and Attention Mechanism