Abstract:In the last few years, the multi-modal emotion recognition has become an important research issue in the affective computing community due to its wide range of applications that include mental disease diagnosis, human behavior understanding, human machine/robot interaction or autonomous driving systems. In this paper, we introduce a novel end-to-end multimodal emotion recognition methodology, based on audio and visual fusion designed to leverage the mutually complementary nature of features while maintaining the modality-specific information. The proposed method integrates spatial, channel and temporal attention mechanisms into a visual 3D convolutional neural network (3D-CNN) and temporal attention into an audio 2D convolutional neural network (2D-CNN) to capture the intra-modal features characteristics. Further, the inter-modal information is captured with the help of an audio-video (A-V) cross-attention fusion technique that effectively identifies salient relationships across the two modalities. Finally, by considering the semantic relations between the emotion categories, we design a novel classification loss based on an emotional metric constraint that guides the attention generation mechanisms. We demonstrate that by exploiting the relations between the emotion categories our method yields more discriminative embeddings, with more compact intra-class representations and increased inter-class separability. The experimental evaluation carried out on the RAVDESS ( The Ryerson Audio-Visual Database of Emotional Speech and Song ), and CREMA-D ( Crowd-sourced Emotional Multimodal Actors Dataset ) datasets validates the proposed methodology, which leads to average accuracy scores of 89.25% and 84.57%, respectively. In addition, when compared to state-of-the-art techniques, the proposed solution shows superior performances, with gains in accuracy ranging in the [1.72%, 11.25%] interval.

Going Beyond Closed Sets: A Multimodal Perspective for Video Emotion Analysis.

Emotion Recognition in Videos via Fusing Multimodal Features.

Bridging the Emotional Semantic Gap via Multimodal Relevance Estimation

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

Multi-modal emotion analysis from facial expressions and electroencephalogram.

Multimodal interaction enhanced representation learning for video emotion recognition

Bridging Discrete and Continuous: A Multimodal Strategy for Complex Emotion Detection

Video Sentiment Analysis with Bimodal Information-augmented Multi-Head Attention

FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video Emotion Recognition Inference

Emotional Video Captioning With Vision-Based Emotion Interpretation Network

Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning

MSEVA : A System for Multimodal Short Videos Emotion Visual Analysis

Toward Multimodal Modeling of Emotional Expressiveness

MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Micro-Expression Dynamics in Video Dialogues

Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model

Image-Text Multimodal Emotion Classification via Multi-View Attentional Network

Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages

Multimodal Emotion Recognition by Extracting Common and Modality-Specific Information.

Multimodal Emotion Recognition by Fusing Video Semantic in MOOC Learning Scenarios

MEmoR: A Dataset for Multimodal Emotion Reasoning in Videos

Multi-View Common Space Learning For Emotion Recognition In The Wild