Target-Dependent Multimodal Sentiment Analysis Via Employing Visual-to Emotional-Caption Translation Network using Visual-Caption Pairs

Ananya Pandey,Dinesh Kumar Vishwakarma

2024-08-05

Abstract:The natural language processing and multimedia field has seen a notable surge in interest in multimodal sentiment recognition. Hence, this study aims to employ Target-Dependent Multimodal Sentiment Analysis (TDMSA) to identify the level of sentiment associated with every target (aspect) stated within a multimodal post consisting of a visual-caption pair. Despite the recent advancements in multimodal sentiment recognition, there has been a lack of explicit incorporation of emotional clues from the visual modality, specifically those pertaining to facial expressions. The challenge at hand is to proficiently obtain visual and emotional clues and subsequently synchronise them with the textual content. In light of this fact, this study presents a novel approach called the Visual-to-Emotional-Caption Translation Network (VECTN) technique. The primary objective of this strategy is to effectively acquire visual sentiment clues by analysing facial expressions. Additionally, it effectively aligns and blends the obtained emotional clues with the target attribute of the caption mode. The experimental findings demonstrate that our methodology is capable of producing ground-breaking outcomes when applied to two publicly accessible multimodal Twitter datasets, namely, Twitter-2015 and Twitter-2017. The experimental results show that the suggested model achieves an accuracy of 81.23% and a macro-F1 of 80.61% on the Twitter-15 dataset, while 77.42% and 75.19% on the Twitter-17 dataset, respectively. The observed improvement in performance reveals that our model is better than others when it comes to collecting target-level sentiment in multimodal data using the expressions of the face.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the problem of Target-Dependent Multimodal Sentiment Analysis (TDMSA) in multimodal sentiment analysis, specifically how to effectively extract sentiment information from social media posts that contain images and text descriptions. The core challenge of the research lies in how to efficiently obtain emotional cues from the visual modality and synchronously integrate them with the textual content. The paper proposes a novel method—Visual to Emotion Description Translation Network (VECTN), which focuses on extracting emotional cues from facial expressions and aligning and integrating them with target entities in the text modality. In this way, the model can better understand the emotional information in images and combine it with text descriptions for sentiment classification. Experimental results show that the model achieves significant performance on two public multimodal Twitter datasets (Twitter-2015 and Twitter-2017), with an accuracy of 81.23% and a macro-average F1 score of 80.61% on the Twitter-15 dataset, and an accuracy of 77.42% and a macro-average F1 score of 75.19% on the Twitter-17 dataset. This indicates that the model outperforms other methods in collecting target-level emotional information using facial expressions.

Target-Dependent Multimodal Sentiment Analysis Via Employing Visual-to Emotional-Caption Translation Network using Visual-Caption Pairs

Target-oriented Sentiment Classification with Sequential Cross-modal Semantic Graph

Multimodal Sentiment Analysis To Explore the Structure of Emotions

A Multitask Multimodal Ensemble Model for Sentiment- and Emotion-Aided Tweet Act Classification

Image-Text Multimodal Emotion Classification via Multi-View Attentional Network

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

TEDT: Transformer-Based Encoding–Decoding Translation Network for Multimodal Sentiment Analysis

Contrastive Learning-based Multi Modal Architecture for Emoticon Prediction by Employing Image-Text Pairs

Topic and Style-aware Transformer for Multimodal Emotion Recognition

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

MEDT: Using Multimodal Encoding-Decoding Network as in Transformer for Multimodal Sentiment Analysis

A Deep Multi-Level Attentive network for Multimodal Sentiment Analysis

Transformer-Based Feature Fusion Approach for Multimodal Visual Sentiment Recognition Using Tweets in the Wild

Learning from Adjective-Noun Pairs: A Knowledge-enhanced Framework for Target-Oriented Multimodal Sentiment Classification.

Various syncretic co‐attention network for multimodal sentiment analysis

Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages

Multimodal Sentiment Analysis: Perceived vs Induced Sentiments

Visual sentiment analysis using data-augmented deep transfer learning techniques

Attention-based multimodal sentiment analysis and emotion recognition using deep neural networks

Emotional Video Captioning With Vision-Based Emotion Interpretation Network

An efficient multimodal sentiment analysis in social media using hybrid optimal multi-scale residual attention network