Abstract:The immanent dependencies between audio and visual modalities extracted from video content and the well-established film grammar (i.e., domain knowledge) are important for emotion video recognition and regression. However, these tools have yet to be exploited successfully. Therefore, we propose a multimodal deep regression Bayesian network (MMDRBN) to capture the relationship between audio and visual modalities for emotion video tagging. We then modify the structure of the MMDRBN to incorporate domain knowledge. A regression Bayesian network (RBN) is formed from one latent layer, one visible layer and directed links from the latent layer to the visible layer. RBN is able to fully represent the data, since it captures the dependencies not only among the visible variables but also among the latent variables given visible variables. For the MMDRBN, first, we learn several layers of RBNs using audio and visual modalities, and then stack these RBNs to form two deep networks. A joint representation is obtained from the top layers of the two deep networks, capturing the deep dependencies between audio and visual modalities. We also summarize the main audio and visual elements used by filmmakers to convey emotions and formulate them as semantical meaningful middle-level representation, i.e., attributes. Through these attributes, we construct the knowledge-augmented MMDRBN, which learns a hybrid middle-level video representation using video data and the summarized attributes. Experimental results of both emotion recognition and regression from videos on the LIRIS-ACCEDE database demonstrate that the proposed model can successfully capture the intrinsic connections between audio and visual modalities, and integrate the middle-level representation learning from video data and semantical attributes summarized from film grammar. Thus, it achieves superior performance on emotion video tagging compared to state-of-the-art methods.

Multimodal Knowledge Expansion Supplementary Materials

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English

4DME: A Spontaneous 4D Micro-Expression Dataset with Multimodalities

Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning

A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset

Multi-Microphone and Multi-Modal Emotion Recognition in Reverberant Environment

Multimodal Emotion Recognition by Extracting Common and Modality-Specific Information.

Multimodal emotion recognition using SDA-LDA algorithm in video clips

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning

Versatile audio-visual learning for emotion recognition

Multimodal interaction enhanced representation learning for video emotion recognition

Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition

ADAM optimised human speech emotion recogniser based on statistical information distribution of chroma, MFCC, and MBSE features

Robust Audiovisual Emotion Recognition: Aligning Modalities, Capturing Temporal Information, and Handling Missing Features

Knowledge-Augmented Multimodal Deep Regression Bayesian Networks for Emotion Video Tagging

Multimodal Speech Emotion Recognition Using Modality-specific Self-Supervised Frameworks

Whose Emotion Matters? Speaking Activity Localisation without Prior Knowledge

MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation

Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention

Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages