Abstract:In the last few years, the multi-modal emotion recognition has become an important research issue in the affective computing community due to its wide range of applications that include mental disease diagnosis, human behavior understanding, human machine/robot interaction or autonomous driving systems. In this paper, we introduce a novel end-to-end multimodal emotion recognition methodology, based on audio and visual fusion designed to leverage the mutually complementary nature of features while maintaining the modality-specific information. The proposed method integrates spatial, channel and temporal attention mechanisms into a visual 3D convolutional neural network (3D-CNN) and temporal attention into an audio 2D convolutional neural network (2D-CNN) to capture the intra-modal features characteristics. Further, the inter-modal information is captured with the help of an audio-video (A-V) cross-attention fusion technique that effectively identifies salient relationships across the two modalities. Finally, by considering the semantic relations between the emotion categories, we design a novel classification loss based on an emotional metric constraint that guides the attention generation mechanisms. We demonstrate that by exploiting the relations between the emotion categories our method yields more discriminative embeddings, with more compact intra-class representations and increased inter-class separability. The experimental evaluation carried out on the RAVDESS ( The Ryerson Audio-Visual Database of Emotional Speech and Song ), and CREMA-D ( Crowd-sourced Emotional Multimodal Actors Dataset ) datasets validates the proposed methodology, which leads to average accuracy scores of 89.25% and 84.57%, respectively. In addition, when compared to state-of-the-art techniques, the proposed solution shows superior performances, with gains in accuracy ranging in the [1.72%, 11.25%] interval.

CCMA: CapsNet for audio–video sentiment analysis using cross-modal attention

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

Video Sentiment Analysis with Bimodal Information-augmented Multi-Head Attention

Video-Based Cross-Modal Auxiliary Network for Multimodal Sentiment Analysis

Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism

CMCI: A Robust Multimodal Fusion Method for Spiking Neural Networks

MATF: main-auxiliary transformer fusion for multi-modal sentiment analysis

TCAN: Text-oriented Cross Attention Network for Multimodal Sentiment Analysis

Exploring Multimodal Sentiment Analysis via CBAM Attention and Double-layer BiLSTM Architecture

Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup Consistent Module

CTHFNet: contrastive translation and hierarchical fusion network for text–video–audio sentiment analysis

Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning

Multimodal Sentiment Analysis Based on Cross-Modal Attention and Gated Cyclic Hierarchical Fusion Networks

A Multimodal Sentiment Analysis Method Integrating Multi-Layer Attention Interaction and Multi-Feature Enhancement

Context-Dependent Multimodal Sentiment Analysis Based on a Complex Attention Mechanism

GCM-Net: Graph-enhanced Cross-Modal Infusion with a Metaheuristic-Driven Network for Video Sentiment and Emotion Analysis

Mutual information maximization and feature space separation and bi-bimodal mo-dality fusion for multimodal sentiment analysis

Cross-modal Enhancement Network for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis Based on a Cross-Modal Multihead Attention Mechanism