Abstract:<p>Social media has become indispensable to people's lives, where they can share their views and emotion with images and texts. Analyzing social images for sentiment prediction can help understand human social behavior and provide better recommendation results. Most current researches on image sentiment analysis have achieved quite good progress, which ignores the semantic correlation between an image and its corresponding descriptive sentences (caption). To capture the complementary multimodal information for joint sentiment classification, in this paper, we propose a novel cross-modal Semantic Content Correlation(SCC) method based on deep matching and hierarchical networks, which bridges the correlation between images and captions. Specifically, pre-trained convolutional neural networks (CNNs) are leveraged to encode the visual sub-regions contents, and a GloVe is employed to embed the textual semantic. Relying on visual contents and textual semantic, a joint attention network is proposed to learn the content correlation of the image and its caption, which is then exported as an image-text pair. To exploit the dependence of visual contents on textual semantic in caption effectively, the caption is processed by a Class-Aware Sentence Representation (CASR) network with a class dictionary, and a fully connected layer concatenates the outputs of CASR into a class-aware vector. Finally, the class-aware distributed vector is fed into an Inner-class Dependency Long Short-Term Memory network (IDLSTM) with the image–text pair as a query to further capture the cross-modal non-linear correlations for sentiment prediction. The performance of extensive experiments conducted on three datasets verifies the effectiveness of the model SCC.</p>

Image Tagging Via Cross-Modal Semantic Mapping

Learning Visually Aligned Semantic Graph for Cross-Modal Manifold Matching.

Dual Collaborative Visual-Semantic Mapping for Multi-Label Zero-Shot Image Recognition

Bridging the Semantic Gap Between Image Contents and Tags

Target-oriented Sentiment Classification with Sequential Cross-modal Semantic Graph

Cross-Modality Bridging and Knowledge Transferring for Image Understanding

Cross-Modal Image-Text Retrieval with Semantic Consistency

Learning to Tag.

Towards Semantic Embedding In Visual Vocabulary

Semantic Tag Augmented XlanV Model for Video Captioning

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Towards Multi-Semantic Image Annotation with Graph Regularized Exclusive Group Lasso

Exploiting Multi-Context Analysis in Semantic Image Classification

Cross-modal image sentiment analysis via deep correlation of textual semantic

Transfer Tagging from Image to Video

Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding

TAG: Guidance-free Open-Vocabulary Semantic Segmentation

Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

Image Tag Recommendation via Deep Cross-Modal Correlation Mining.

Cross-modal Semantic Interference Suppression for image-text matching

Visual Content Recognition by Exploiting Semantic Feature Map with Attention and Multi-task Learning