Abstract:In the last few years, the multi-modal emotion recognition has become an important research issue in the affective computing community due to its wide range of applications that include mental disease diagnosis, human behavior understanding, human machine/robot interaction or autonomous driving systems. In this paper, we introduce a novel end-to-end multimodal emotion recognition methodology, based on audio and visual fusion designed to leverage the mutually complementary nature of features while maintaining the modality-specific information. The proposed method integrates spatial, channel and temporal attention mechanisms into a visual 3D convolutional neural network (3D-CNN) and temporal attention into an audio 2D convolutional neural network (2D-CNN) to capture the intra-modal features characteristics. Further, the inter-modal information is captured with the help of an audio-video (A-V) cross-attention fusion technique that effectively identifies salient relationships across the two modalities. Finally, by considering the semantic relations between the emotion categories, we design a novel classification loss based on an emotional metric constraint that guides the attention generation mechanisms. We demonstrate that by exploiting the relations between the emotion categories our method yields more discriminative embeddings, with more compact intra-class representations and increased inter-class separability. The experimental evaluation carried out on the RAVDESS ( The Ryerson Audio-Visual Database of Emotional Speech and Song ), and CREMA-D ( Crowd-sourced Emotional Multimodal Actors Dataset ) datasets validates the proposed methodology, which leads to average accuracy scores of 89.25% and 84.57%, respectively. In addition, when compared to state-of-the-art techniques, the proposed solution shows superior performances, with gains in accuracy ranging in the [1.72%, 11.25%] interval.

MEC 2017: Multimodal Emotion Recognition Challenge

MEC 2016: The Multimodal Emotion Recognition Challenge of CCPR 2016.

Emotion Recognition in Videos via Fusing Multimodal Features.

A Efficient Multimodal Framework for Large Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Ensemble System for Multimodal Emotion Recognition Challenge (MEC 2017)

MEGC2023: ACM Multimedia 2023 ME Grand Challenge

MEGC2024: ACM Multimedia 2024 Facial Micro-Expression Grand Challenge

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Multimodal Emotion Recognition and Sentiment Analysis via Attention Enhanced Recurrent Model

Investigation of Multimodal Features, Classifiers and Fusion Methods for Emotion Recognition

Hybrid Mutimodal Fusion for Dimensional Emotion Recognition

MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning.

Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning

The MuSe 2024 Multimodal Sentiment Analysis Challenge: Social Perception and Humor Recognition

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Emotion Recognition With Audio, Video, EEG, and EMG: A Dataset and Baseline Approaches

SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition

Adversarial Domain Adaption for Multi-Cultural Dimensional Emotion Recognition in Dyadic Interactions

Multi-modal Continuous Dimensional Emotion Recognition Using Recurrent Neural Network and Self-Attention Mechanism

The CASIA Audio Emotion Recognition Method for Audio/Visual Emotion Challenge 2011