Abstract:There has been a growing interest in multimodal sentiment analysis and emotion recognition in recent years due to its wide range of practical applications. Multiple modalities allow for the integration of complementary information, improving the accuracy and precision of sentiment and emotion recognition tasks. However, working with multiple modalities presents several challenges, including handling data source heterogeneity, fusing information, aligning and synchronizing modalities, and designing effective feature extraction techniques that capture discriminative information from each modality. This paper introduces a novel framework called "Attention-based Multimodal Sentiment Analysis and Emotion Recognition (AMSAER)" to address these challenges. This framework leverages intra-modality discriminative features and inter-modality correlations in visual, audio, and textual modalities. It incorporates an attention mechanism to facilitate sentiment and emotion classification based on visual, textual, and acoustic inputs by emphasizing relevant aspects of the task. The proposed approach employs separate models for each modality to automatically extract discriminative semantic words, image regions, and audio features. A deep hierarchical model is then developed, incorporating intermediate fusion to learn hierarchical correlations between the modalities at bimodal and trimodal levels. Finally, the framework combines four distinct models through decision-level fusion to enable multimodal sentiment analysis and emotion recognition. The effectiveness of the proposed framework is demonstrated through extensive experiments conducted on the publicly available Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset. The results confirm a notable performance improvement compared to state-of-the-art methods, attaining 85% and 93% accuracy for sentiment analysis and emotion classification, respectively. Additionally, when considering class-wise accuracy, the results indicate that the "angry" emotion and "positive" sentiment are classified more effectively than the other emotions and sentiments, achieving 96.80% and 93.14% accuracy, respectively.

Multi-attention Recurrent Network for Human Communication Comprehension

Human Conversation Analysis Using Attentive Multimodal Networks with Hierarchical Encoder-Decoder

Multimodal Emotional Classification Based on Meaningful Learning

Multimodal Language Analysis with Recurrent Multistage Fusion

Hierarchical Attention Model for Improved Machine Comprehension of Spoken Content

Multimodal emotion recognition based on audio and text by using hybrid attention networks

A joint hierarchical cross‐attention graph convolutional network for multi‐modal facial expression recognition

Multi-modal Attention for Speech Emotion Recognition

Real-Time Emotion Recognition via Attention Gated Hierarchical Memory Network

MRSLN: A Multimodal Residual Speaker-LSTM Network to alleviate the over-smoothing issue for Emotion Recognition in Conversation

MLNet: a multi-level multimodal named entity recognition architecture

Speech Emotion Recognition Using Multi-hop Attention Mechanism

HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion Recognition

Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation

Multi-Microphone and Multi-Modal Emotion Recognition in Reverberant Environment

Attention-based multimodal sentiment analysis and emotion recognition using deep neural networks

MLGAT: multi-layer graph attention networks for multimodal emotion recognition in conversations

AMuSE: Adaptive Multimodal Analysis for Speaker Emotion Recognition in Group Conversations

MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences

Hierarchical Hypercomplex Network for Multimodal Emotion Recognition

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features