Abstract:Bridging visual and textual representations plays a central role in delving into multimedia data understanding. The main challenge arises from that images and texts exist in heterogeneous spaces, leading to the difficulty to preserve the semantic consistency between both modalities. To narrow the modality gap, most recent methods resort to extra object detectors or parsers to obtain the hierarchical representations. In this work, we address this problem by introducing our Multi-Task Hierarchical Convolutional Neural Network (MT-HCN). It is characterized by mining the hierarchical semantic information without the aid of any extra supervisions. Firstly, from the perspective of representing architecture, we leverage the intrinsic hierarchical structure of Convolutional Neural Networks (CNNs) to decompose the representations of both modalities into two semantically complementary levels, i.e. , exterior representations and concept representations. The former focuses on discovering the fine-grained low-level associations between both modalities, meanwhile the latter underlines capturing more high-level abstract semantics. Specifically, we present a Self-Supervised Clustering (SSC) loss to preserve more fine-grained semantic clues in exterior representations. It is constituted on the basis of viewing multiple image/text pairs with similar exterior as a category. In addition, a novel harmonious bidirectional triplet ranking (HBTR) loss is proposed, which mitigate the adverse effects brought about by the biased and noisy negative samples. Besides hardest negatives, it also imposes the constraints on the distance between the positive pairs and the centroid of negative pairs. Extensive experimental results on two popular cross-modal retrieval benchmarks demonstrate our proposed MT-HCN can achieve the competitive results compared with the state-of-the-art methods.

Deep Coordinated Textual and Visual Network for Sentiment-Oriented Cross-Modal Retrieval

Beyond Object Recognition: Visual Sentiment Analysis with Deep Coupled Adjective and Noun Neural Networks

Cross-modal image sentiment analysis via deep correlation of textual semantic

Various syncretic co‐attention network for multimodal sentiment analysis

Visual-Textual Sentiment Analysis Enhanced by Hierarchical Cross-Modality Interaction

Target-oriented Sentiment Classification with Sequential Cross-modal Semantic Graph

End-to-End Deep Memory Network for Visual-Textual Sentiment Analysis

Visual sentiment analysis based on image caption and adjective–noun–pair description

MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis

Joint Visual-Textual Sentiment Analysis Based on Cross-Modality Attention Mechanism.

Object-Based Visual Sentiment Concept Analysis and Application

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Cross-modal fine-grained alignment and fusion network for multimodal aspect-based sentiment analysis

Cross-Modality Sentiment Analysis for Social Multimedia

VISUAL AND TEXTUAL SENTIMENT ANALYSIS USING DEEP FUSION CONVOLUTIONAL NEURAL NETWORKS

Visual-textual Sentiment Classification with Bi-Directional Multi-Level Attention Networks

MASANet: Multi-Aspect Semantic Auxiliary Network for Visual Sentiment Analysis

VisdaNet: Visual Distillation and Attention Network for Multimodal Sentiment Classification

Multi-layer cross-modality attention fusion network for multimodal sentiment analysis

Multi-task hierarchical convolutional network for visual-semantic cross-modal retrieval

Visual sentiment analysis with semantic correlation enhancement