Abstract:In the circumstance of social big data, sentiment analysis is attracting increasing attention for its capacity in understanding individuals' attitudes and feelings. Traditional sentiment analysis methods focus on single modality and become ineffective as enormous data are emerging on the social websites with multiple manifestations. In this article, multimodal learning approaches are proposed to capture the relations between image and text, which only stay at the region level and ignore the fact that the channels are also closely correlated with the semantic information. In addition, social images in the social platforms are closely connected by various types of relations, which are also conducice to sentiment classification but neglected by most existing works. In this article, we propose an attention-based heterogeneous relational model to improve the multimodal sentiment analysis performance by incorporating rich social information. Specifically, we propose a progressive dual attention module to capture the correlations between image and text, and then learn the joint image-text representation from the perspective of content information. A channel attention schema is proposed here to highlight semantically rich image channels and a region attention schema is further designed to highlight the emotional regions based on the attended channels. After that, we construct a heterogeneous relation network and extend graph convolutional network to aggregate the content information from social contexts as complements to learn high-quality representations of social images. Our proposal is thoroughly evaluated on two benchmark datasets, and experimental results demonstrate the superiority of the proposed model.

Dual-Stream Pre-Training Transformer to Enhance Multimodal Learning for Social Media Prediction

Tri-Modal Transformers with Mixture-of-Modality-Experts for Social Media Prediction

Double-Fine-Tuning Multi-Objective Vision-and-Language Transformer for Social Media Popularity Prediction

Title-and-Tag Contrastive Vision-and-Language Transformer for Social Media Popularity Prediction

A Multimodal Transformer for Live Streaming Highlight Prediction

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

PDT: Pretrained Dual Transformers for Time-aware Bipartite Graphs

Different Data, Different Modalities! Reinforced Data Splitting for Effective Multimodal Information Extraction from Social Media Posts.

Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis

Curriculum Learning for Wide Multimedia-Based Transformer with Graph Target Detection

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media

Unimodal Training-Multimodal Prediction: Cross-modal Federated Learning with Hierarchical Aggregation

Improving Social Media Popularity Prediction with Multiple Post Dependencies

TEDT: Transformer-Based Encoding–Decoding Translation Network for Multimodal Sentiment Analysis

A Feature Generalization Framework for Social Media Popularity Prediction

DualTime: A Dual-Adapter Multimodal Language Model for Time Series Representation

Hybrid Deep Sequential Modeling for Social Text-Driven Stock Prediction

Social Image Sentiment Analysis by Exploiting Multimodal Content and Heterogeneous Relations

SS-Trans (Single-Stream Transformer for Multimodal Sentiment Analysis and Emotion Recognition): The Emotion Whisperer—A Single-Stream Transformer for Multimodal Sentiment Analysis