Abstract:In the circumstance of social big data, sentiment analysis is attracting increasing attention for its capacity in understanding individuals' attitudes and feelings. Traditional sentiment analysis methods focus on single modality and become ineffective as enormous data are emerging on the social websites with multiple manifestations. In this article, multimodal learning approaches are proposed to capture the relations between image and text, which only stay at the region level and ignore the fact that the channels are also closely correlated with the semantic information. In addition, social images in the social platforms are closely connected by various types of relations, which are also conducice to sentiment classification but neglected by most existing works. In this article, we propose an attention-based heterogeneous relational model to improve the multimodal sentiment analysis performance by incorporating rich social information. Specifically, we propose a progressive dual attention module to capture the correlations between image and text, and then learn the joint image-text representation from the perspective of content information. A channel attention schema is proposed here to highlight semantically rich image channels and a region attention schema is further designed to highlight the emotional regions based on the attended channels. After that, we construct a heterogeneous relation network and extend graph convolutional network to aggregate the content information from social contexts as complements to learn high-quality representations of social images. Our proposal is thoroughly evaluated on two benchmark datasets, and experimental results demonstrate the superiority of the proposed model.

Multimodal Network Embedding Via Attention Based Multi-view Variational Autoencoder.

Multimodal Learning of Social Image Representation by Exploiting Social Relations

Learning Social Image Embedding with Deep Multimodal Attention Networks

A Markov Random Field Multi-Modal Variational AutoEncoder

Multimodal Semantic Attention Network for Video Captioning

Deep Attentive Multimodal Network Representation Learning for Social Media Images

Enhancing Unimodal Latent Representations in Multimodal VAEs through Iterative Amortized Inference

A Multimodal Visual Encoding Model Aided by Introducing Verbal Semantic Information

Temporal Network Embedding for Link Prediction via VAE Joint Attention Mechanism

Social Image Sentiment Analysis by Exploiting Multimodal Content and Heterogeneous Relations

Multimodal Weibull Variational Autoencoder for Jointly Modeling Image-Text Data

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

Attention-Based Modality-Gated Networks for Image-Text Sentiment Analysis

Multi-view visual semantic embedding for cross-modal image–text retrieval

Learning Multimodal Attention LSTM Networks for Video Captioning.

Various syncretic co‐attention network for multimodal sentiment analysis

Learning Multimodal VAEs through Mutual Supervision

Intra-view and Inter-view Attention for Multi-view Network Embedding

Image-Text Multimodal Emotion Classification via Multi-View Attentional Network