Abstract:In the last few years, the multi-modal emotion recognition has become an important research issue in the affective computing community due to its wide range of applications that include mental disease diagnosis, human behavior understanding, human machine/robot interaction or autonomous driving systems. In this paper, we introduce a novel end-to-end multimodal emotion recognition methodology, based on audio and visual fusion designed to leverage the mutually complementary nature of features while maintaining the modality-specific information. The proposed method integrates spatial, channel and temporal attention mechanisms into a visual 3D convolutional neural network (3D-CNN) and temporal attention into an audio 2D convolutional neural network (2D-CNN) to capture the intra-modal features characteristics. Further, the inter-modal information is captured with the help of an audio-video (A-V) cross-attention fusion technique that effectively identifies salient relationships across the two modalities. Finally, by considering the semantic relations between the emotion categories, we design a novel classification loss based on an emotional metric constraint that guides the attention generation mechanisms. We demonstrate that by exploiting the relations between the emotion categories our method yields more discriminative embeddings, with more compact intra-class representations and increased inter-class separability. The experimental evaluation carried out on the RAVDESS ( The Ryerson Audio-Visual Database of Emotional Speech and Song ), and CREMA-D ( Crowd-sourced Emotional Multimodal Actors Dataset ) datasets validates the proposed methodology, which leads to average accuracy scores of 89.25% and 84.57%, respectively. In addition, when compared to state-of-the-art techniques, the proposed solution shows superior performances, with gains in accuracy ranging in the [1.72%, 11.25%] interval.

A Crossmodal Approach to Multimodal Fusion in Video Hyperlinking

A Crossmodal Approach to Multimodal Fusion in Video Hyperlinking

Cross-modal Embeddings for Video and Audio Retrieval

Generative Adversarial Networks for Multimodal Representation Learning in Video Hyperlinking

Exploiting Multimodality in Video Hyperlinking to Improve Target Diversity

Dynamic Multimodal Fusion in Video Search

CMCI: A Robust Multimodal Fusion Method for Spiking Neural Networks

Fusion of Multimodal Embeddings for Ad-Hoc Video Search

Attention-Based Multimodal Fusion for Video Description

Neural Dependency Coding inspired Multimodal Fusion

Everything is a Video: Unifying Modalities through Next-Frame Prediction

Multimodal Fusion for Video Search Reranking

Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation

Video and Audio are Images: A Cross-Modal Mixer for Original Data on Video-Audio Retrieval

Concept-Driven Multi-Modality Fusion for Video Search

Joint embeddings with multimodal cues for video-text retrieval

Applying recent advances in Visual Question Answering to Record Linkage

Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning

Cross-modal Search Method of Technology Video based on Adversarial Learning and Feature Fusion