Abstract:In the last few years, the multi-modal emotion recognition has become an important research issue in the affective computing community due to its wide range of applications that include mental disease diagnosis, human behavior understanding, human machine/robot interaction or autonomous driving systems. In this paper, we introduce a novel end-to-end multimodal emotion recognition methodology, based on audio and visual fusion designed to leverage the mutually complementary nature of features while maintaining the modality-specific information. The proposed method integrates spatial, channel and temporal attention mechanisms into a visual 3D convolutional neural network (3D-CNN) and temporal attention into an audio 2D convolutional neural network (2D-CNN) to capture the intra-modal features characteristics. Further, the inter-modal information is captured with the help of an audio-video (A-V) cross-attention fusion technique that effectively identifies salient relationships across the two modalities. Finally, by considering the semantic relations between the emotion categories, we design a novel classification loss based on an emotional metric constraint that guides the attention generation mechanisms. We demonstrate that by exploiting the relations between the emotion categories our method yields more discriminative embeddings, with more compact intra-class representations and increased inter-class separability. The experimental evaluation carried out on the RAVDESS ( The Ryerson Audio-Visual Database of Emotional Speech and Song ), and CREMA-D ( Crowd-sourced Emotional Multimodal Actors Dataset ) datasets validates the proposed methodology, which leads to average accuracy scores of 89.25% and 84.57%, respectively. In addition, when compared to state-of-the-art techniques, the proposed solution shows superior performances, with gains in accuracy ranging in the [1.72%, 11.25%] interval.

Multi-Modal Person Identification In A Smart Environment

A Multimodal Sensor Fusion Framework Robust to Missing Modalities for Person Recognition

Comparative Analysis of Modality Fusion Approaches for Audio-Visual Person Identification and Verification

A Robust Face and Ear Based Multimodal Biometric System Using Sparse Representation

Audio-Visual Fusion Based on Interactive Attention for Person Verification

Multimodal person authentication using speech, face and visual speech

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

Audio-visual multi-person tracking and identification for smart environments

Large-scale Multi-modal Person Identification in Real Unconstrained Environments

Enhancing Recognition in Multimodal Biometric Systems: Score Normalization and Fusion of Online Signatures and Fingerprints

Artificial intelligence-Enabled deep learning model for multimodal biometric fusion

Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition

Multimodal biometric system using rank-level fusion approach

Quality-Aware Multimodal Biometric Recognition

Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning

IMPACT OF VISUAL MODALITIES IN MULTIMODAL PERSONALITY AND AFFECTIVE COMPUTING

DeepMEF

Adaptive information fusion network for multi‐modal personality recognition

Human Action Recognition Using Deep Multilevel Multimodal (M2) Fusion of Depth and Inertial Sensors

Multimodal Emotion Recognition Based on Facial Expressions, Speech, and Body Gestures

Intelligence Methods of Multi-Modal Information Fusion in Human-Computer Interaction