Abstract:<p>Social media has become indispensable to people's lives, where they can share their views and emotion with images and texts. Analyzing social images for sentiment prediction can help understand human social behavior and provide better recommendation results. Most current researches on image sentiment analysis have achieved quite good progress, which ignores the semantic correlation between an image and its corresponding descriptive sentences (caption). To capture the complementary multimodal information for joint sentiment classification, in this paper, we propose a novel cross-modal Semantic Content Correlation(SCC) method based on deep matching and hierarchical networks, which bridges the correlation between images and captions. Specifically, pre-trained convolutional neural networks (CNNs) are leveraged to encode the visual sub-regions contents, and a GloVe is employed to embed the textual semantic. Relying on visual contents and textual semantic, a joint attention network is proposed to learn the content correlation of the image and its caption, which is then exported as an image-text pair. To exploit the dependence of visual contents on textual semantic in caption effectively, the caption is processed by a Class-Aware Sentence Representation (CASR) network with a class dictionary, and a fully connected layer concatenates the outputs of CASR into a class-aware vector. Finally, the class-aware distributed vector is fed into an Inner-class Dependency Long Short-Term Memory network (IDLSTM) with the image–text pair as a query to further capture the cross-modal non-linear correlations for sentiment prediction. The performance of extensive experiments conducted on three datasets verifies the effectiveness of the model SCC.</p>

Neural Visual Social Comment on Image-Text Content

Visualizing and Understanding Neural Models in NLP

Borrowing Human Senses: Comment-Aware Self-Training for Social Media Multimodal Classification

Visual-Textual Sentiment Analysis Enhanced by Hierarchical Cross-Modality Interaction

Cross-Modal Commentator: Automatic Machine Commenting Based on Cross-Modal Information.

VCMaster: Generating Diverse and Fluent Live Video Comments Based on Multimodal Contexts

An Attempt towards Interpretable Audio-Visual Video Captioning

Multimodality-guided Visual-Caption Semantic Enhancement

Netizen-Style Commenting on Fashion Photos: Dataset and Diversity Measures

Share-and-Chat: Achieving Human-Level Video Commenting by Search and Multi-View Embedding.

Hybrid context enriched deep learning model for fine-grained sentiment analysis in textual and visual semiotic modality social data

Comment-Guided Semantics-Aware Image Aesthetics Assessment

Emotional Video Captioning With Vision-Based Emotion Interpretation Network

Neuraltalk+: neural image captioning with visual assistance capabilities

Cross-modal image sentiment analysis via deep correlation of textual semantic

Unsupervised Machine Commenting with Neural Variational Topic Model

Predicting Viewer Affective Comments Based on Image Content in Social Media

Towards Usable Neural Comment Generation Via Code-Comment Linkage Interpretation: Method and Empirical Study

Various syncretic co‐attention network for multimodal sentiment analysis

Social Image Sentiment Analysis by Exploiting Multimodal Content and Heterogeneous Relations

Multimodal Neural Machine Translation with Search Engine Based Image Retrieval