Abstract:There has been a growing interest in multimodal sentiment analysis and emotion recognition in recent years due to its wide range of practical applications. Multiple modalities allow for the integration of complementary information, improving the accuracy and precision of sentiment and emotion recognition tasks. However, working with multiple modalities presents several challenges, including handling data source heterogeneity, fusing information, aligning and synchronizing modalities, and designing effective feature extraction techniques that capture discriminative information from each modality. This paper introduces a novel framework called "Attention-based Multimodal Sentiment Analysis and Emotion Recognition (AMSAER)" to address these challenges. This framework leverages intra-modality discriminative features and inter-modality correlations in visual, audio, and textual modalities. It incorporates an attention mechanism to facilitate sentiment and emotion classification based on visual, textual, and acoustic inputs by emphasizing relevant aspects of the task. The proposed approach employs separate models for each modality to automatically extract discriminative semantic words, image regions, and audio features. A deep hierarchical model is then developed, incorporating intermediate fusion to learn hierarchical correlations between the modalities at bimodal and trimodal levels. Finally, the framework combines four distinct models through decision-level fusion to enable multimodal sentiment analysis and emotion recognition. The effectiveness of the proposed framework is demonstrated through extensive experiments conducted on the publicly available Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset. The results confirm a notable performance improvement compared to state-of-the-art methods, attaining 85% and 93% accuracy for sentiment analysis and emotion classification, respectively. Additionally, when considering class-wise accuracy, the results indicate that the "angry" emotion and "positive" sentiment are classified more effectively than the other emotions and sentiments, achieving 96.80% and 93.14% accuracy, respectively.

Investigating Multisensory Integration in Emotion Recognition Through Bio-Inspired Computational Models

Synch-Graph: Multisensory Emotion Recognition Through Neural Synchrony Via Graph Convolutional Networks.

CMCI: A Robust Multimodal Fusion Method for Spiking Neural Networks

Multimodal Emotion Recognition Based on Feature Fusion.

Multimodal modelling of human emotion using sound, image and text fusion

Multimodal Emotional Classification Based on Meaningful Learning

Multimodal Emotion Recognition by Combining Physiological Signals and Facial Expressions: a Preliminary Study.

Speech Emotion Recognition with Early Visual Cross-modal Enhancement Using Spiking Neural Networks.

Research on cross-modal emotion recognition based on multi-layer semantic fusion

Attention-based multimodal sentiment analysis and emotion recognition using deep neural networks

Multimodal emotion recognition model via hybrid model with improved feature level fusion on facial and EEG feature set

Emotion recognition based on brain-like multimodal hierarchical perception

Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning

Multimodal Emotion Recognition Based on Cascaded Multichannel and Hierarchical Fusion

A multimodal emotion recognition model integrating speech, video and MoCAP

E-MFNN: an emotion-multimodal fusion neural network framework for emotion recognition

Multimodal Emotion Recognition Using Different Fusion Techniques

Multimodal Emotion Recognition by Extracting Common and Modality-Specific Information.

Multi-modal fusion network with complementarity and importance for emotion recognition

Emotion recognition using multimodal deep learning in multiple psychophysiological signals and video

MF-Net: a multimodal fusion network for emotion recognition based on multiple physiological signals