Abstract:Bridging visual and textual representations plays a central role in delving into multimedia data understanding. The main challenge arises from that images and texts exist in heterogeneous spaces, leading to the difficulty to preserve the semantic consistency between both modalities. To narrow the modality gap, most recent methods resort to extra object detectors or parsers to obtain the hierarchical representations. In this work, we address this problem by introducing our Multi-Task Hierarchical Convolutional Neural Network (MT-HCN). It is characterized by mining the hierarchical semantic information without the aid of any extra supervisions. Firstly, from the perspective of representing architecture, we leverage the intrinsic hierarchical structure of Convolutional Neural Networks (CNNs) to decompose the representations of both modalities into two semantically complementary levels, i.e. , exterior representations and concept representations. The former focuses on discovering the fine-grained low-level associations between both modalities, meanwhile the latter underlines capturing more high-level abstract semantics. Specifically, we present a Self-Supervised Clustering (SSC) loss to preserve more fine-grained semantic clues in exterior representations. It is constituted on the basis of viewing multiple image/text pairs with similar exterior as a category. In addition, a novel harmonious bidirectional triplet ranking (HBTR) loss is proposed, which mitigate the adverse effects brought about by the biased and noisy negative samples. Besides hardest negatives, it also imposes the constraints on the distance between the positive pairs and the centroid of negative pairs. Extensive experimental results on two popular cross-modal retrieval benchmarks demonstrate our proposed MT-HCN can achieve the competitive results compared with the state-of-the-art methods.

Hierarchical Bi-Directional Conceptual Interaction for Text-Video Retrieval

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment

A Multi-interaction Model with Cross-Branch Feature Fusion for Video-Text Retrieval.

Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval.

Multimodal-enhanced hierarchical attention network for video captioning

HANet: Hierarchical Alignment Networks for Video-Text Retrieval

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

Hierarchical modal interaction balance cross-modal hashing for unsupervised image-text retrieval

Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment.

Multi-Granularity and Multi-modal Feature Interaction Approach for Text Video Retrieval

Bidirectional interactive alignment network for image captioning

End-to-End Pre-Training With Hierarchical Matching and Momentum Contrast for Text-Video Retrieval

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

Multi-task hierarchical convolutional network for visual-semantic cross-modal retrieval

Hierarchical visual-semantic interaction for scene text recognition

CONTEXT-AWARE HIERARCHICAL TRANSFORMER FOR FINE-GRAINED VIDEO-TEXT RETRIEVAL

Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks