Abstract:With the efficiency of storage and retrieval speed, the hashing methods have attracted a lot of attention for cross-modal retrieval applications. In contrast to traditional cross-modal hashing by using handcrafted features, deep cross-modal hashing integrates the advantages of deep learning and hashing methods to encode raw multimodal data into compact binary codes with semantic information preserved. Generally speaking, most of the existing deep cross-modal hashing methods simply define the semantic similarity between heterogeneous modalities by counting the number of shared semantic labels (such as, two samples share at least one label, they are similar, otherwise they are dissimilar), which fails to represent the accurate multi-label semantic relations between heterogeneous data. In this paper, we propose a new Deep Self-supervised Hashing with Fine-grained Similarity Mining (DSH-FSM) framework to efficiently preserve the fine-grained multi-label semantic similarity, learning a highly separable embedding space. Specifically, by employing an asymmetric guidance strategy, a novel Semantic-Network is introduced into cross-modal hashing to learn two semantic dictionaries, including the semantic feature dictionary and the semantic code dictionary, which guides the Image-Network and the Text-Network to capture multi-label semantic relevance across different modalities. Based on the obtained semantic dictionary, an asymmetric margin-scalable loss is proposed to obtain fine-grained pair-wise similarity information, which could contribute to the production of similarity-preserving and discriminative binary codes. Besides, two feature extractors with transformer encoders are designed to achieve the Image-Network and Text-Network, which could extract the representative semantic characteristics from raw heterogeneous samples. Extensive experimental results on various benchmark datasets show that our proposed DSH-FSM framework achieves state-of-the-art cross-modal similarity search performance. Compared to the state-of-the-art methods, the results of mAP are significantly improved by 1.9%, 9.1%, and 9.8%, respectively, on the three widely used datasets.

Deep Semantic-Alignment Hashing for Unsupervised Cross-Modal Retrieval

Semantic Consistency Hashing for Cross-Modal Retrieval

Efficient Discrete Supervised Hashing for Large-scale Cross-modal Retrieval

Discrete Cross-Modal Hashing for Efficient Multimedia Retrieval

Discrete Similarity Preserving Hashing for Cross-modal Retrieval.

Label-wise Deep Semantic-Alignment Hashing for Cross-Modal Retrieval.

Deep Self-Supervised Hashing With Fine-Grained Similarity Mining for Cross-Modal Retrieval

Unsupervised Dual Deep Hashing with Semantic-Index and Content-Code for Cross-Modal Retrieval

Discrete Semantic Alignment Hashing for Cross-Media Retrieval

Deep Cross-modal Hashing Based on Semantic Consistent Ranking

Unsupervised multi-perspective fusing semantic alignment for cross-modal hashing retrieval

Discrete Joint Semantic Alignment Hashing for Cross-Modal Image-Text Search

Deep Adaptively-Enhanced Hashing With Discriminative Similarity Guidance for Unsupervised Cross-Modal Retrieval

Deep Multi-Level Semantic Hashing for Cross-Modal Retrieval

Unsupervised deep hashing with multiple similarity preservation for cross-modal image-text retrieval

Deep Joint-Semantics Reconstructing Hashing For Large-Scale Unsupervised Cross-Modal Retrieval

Pseudo-label driven deep hashing for unsupervised cross-modal retrieval

Deep noise mitigation and semantic reconstruction hashing for unsupervised cross-modal retrieval

Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval

Deep Manifold Hashing: A Divide-and-Conquer Approach for Semi-Paired Unsupervised Cross-Modal Retrieval

Joint-modal Distribution-based Similarity Hashing for Large-scale Unsupervised Deep Cross-modal Retrieval