Abstract:Bridging visual and textual representations plays a central role in delving into multimedia data understanding. The main challenge arises from that images and texts exist in heterogeneous spaces, leading to the difficulty to preserve the semantic consistency between both modalities. To narrow the modality gap, most recent methods resort to extra object detectors or parsers to obtain the hierarchical representations. In this work, we address this problem by introducing our Multi-Task Hierarchical Convolutional Neural Network (MT-HCN). It is characterized by mining the hierarchical semantic information without the aid of any extra supervisions. Firstly, from the perspective of representing architecture, we leverage the intrinsic hierarchical structure of Convolutional Neural Networks (CNNs) to decompose the representations of both modalities into two semantically complementary levels, i.e. , exterior representations and concept representations. The former focuses on discovering the fine-grained low-level associations between both modalities, meanwhile the latter underlines capturing more high-level abstract semantics. Specifically, we present a Self-Supervised Clustering (SSC) loss to preserve more fine-grained semantic clues in exterior representations. It is constituted on the basis of viewing multiple image/text pairs with similar exterior as a category. In addition, a novel harmonious bidirectional triplet ranking (HBTR) loss is proposed, which mitigate the adverse effects brought about by the biased and noisy negative samples. Besides hardest negatives, it also imposes the constraints on the distance between the positive pairs and the centroid of negative pairs. Extensive experimental results on two popular cross-modal retrieval benchmarks demonstrate our proposed MT-HCN can achieve the competitive results compared with the state-of-the-art methods.

Step-Wise Hierarchical Alignment Network for Image-Text Matching

HAAN: Learning a Hierarchical Adaptive Alignment Network for Image-Text Retrieval

Hierarchical Gumbel Attention Network for Text-based Person Search

HANet: Hierarchical Alignment Networks for Video-Text Retrieval

Dual Semantic Relationship Attention Network for Image-Text Matching

Hierarchical Feature Aggregation based on Transformer for Image-text Matching

Learning Aligned Image-Text Representations Using Graph Attentive Relational Network

Reference-Aware Adaptive Network for Image-Text Matching

A Mutually Textual and Visual Refinement Network for Image-Text Matching

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Multi-Scale Fine-Grained Alignments for Image and Sentence Matching

Deep Hierarchical Attention Networks for Text Matching in Information Retrieval

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

Learning Dual Semantic Relations with Graph Attention for Image-Text Matching

Multi-level network based on transformer encoder for fine-grained image–text matching

Hierarchical visual-semantic interaction for scene text recognition

A New Fine-grained Alignment Method for Image-text Matching

Hierarchical Refined Attention for Scene Text Recognition.

Improving Image-Text Matching with Bidirectional Consistency of Cross-Modal Alignment

Multi-task hierarchical convolutional network for visual-semantic cross-modal retrieval

Visual-Textual Sentiment Analysis Enhanced by Hierarchical Cross-Modality Interaction