Abstract:With the efficiency of storage and retrieval speed, the hashing methods have attracted a lot of attention for cross-modal retrieval applications. In contrast to traditional cross-modal hashing by using handcrafted features, deep cross-modal hashing integrates the advantages of deep learning and hashing methods to encode raw multimodal data into compact binary codes with semantic information preserved. Generally speaking, most of the existing deep cross-modal hashing methods simply define the semantic similarity between heterogeneous modalities by counting the number of shared semantic labels (such as, two samples share at least one label, they are similar, otherwise they are dissimilar), which fails to represent the accurate multi-label semantic relations between heterogeneous data. In this paper, we propose a new Deep Self-supervised Hashing with Fine-grained Similarity Mining (DSH-FSM) framework to efficiently preserve the fine-grained multi-label semantic similarity, learning a highly separable embedding space. Specifically, by employing an asymmetric guidance strategy, a novel Semantic-Network is introduced into cross-modal hashing to learn two semantic dictionaries, including the semantic feature dictionary and the semantic code dictionary, which guides the Image-Network and the Text-Network to capture multi-label semantic relevance across different modalities. Based on the obtained semantic dictionary, an asymmetric margin-scalable loss is proposed to obtain fine-grained pair-wise similarity information, which could contribute to the production of similarity-preserving and discriminative binary codes. Besides, two feature extractors with transformer encoders are designed to achieve the Image-Network and Text-Network, which could extract the representative semantic characteristics from raw heterogeneous samples. Extensive experimental results on various benchmark datasets show that our proposed DSH-FSM framework achieves state-of-the-art cross-modal similarity search performance. Compared to the state-of-the-art methods, the results of mAP are significantly improved by 1.9%, 9.1%, and 9.8%, respectively, on the three widely used datasets.

Algorithm Research of ELMo Word Embedding and Deep Learning Multimodal Transformer in Image Description

Domain Adaptation Meets Zero-Shot Learning: an Annotation-Efficient Approach to Multi-Modality Medical Image Segmentation

Manifold Regularized Cross-Modal Embedding for Zero-Shot Learning

Research on Optimization of Natural Language Processing Model Based on Multimodal Deep Learning

On Analyzing the Role of Image for Visual-Enhanced Relation Extraction (student Abstract).

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models

Learning Cross-Aligned Latent Embeddings for Zero-Shot Cross-Modal Retrieval

Zero-Shot Image Tagging By Hierarchical Semantic Embedding

Deep Image Annotation and Classification by Fusing Multi-Modal Semantic Topics

Deep Semisupervised Zero-Shot Learning with Maximum Mean Discrepancy

Transductive Zero-Shot Learning with a Self-Training Dictionary Approach

Deep Semantic-Visual Alignment for Zero-Shot Remote Sensing Image Scene Classification

Towards Effective Deep Embedding for Zero-Shot Learning

Deep Transfer Learning For Modality Classification Of Medical Images

Deep Visual Semantic Embedding with Text Data Augmentation and Word Embedding Initialization

Transductive Zero-Shot Action Recognition by Word-Vector Embedding

Deep Self-Supervised Hashing With Fine-Grained Similarity Mining for Cross-Modal Retrieval

Transductive Multi-label Zero-shot Learning.

Multi-modal Remote Sensing Image Description Based on Word Embedding and Self-Attention Mechanism

Deep Multi-Similarity Hashing Via Label-Guided Network for Cross-Modal Retrieval

Cross-Modality Bridging and Knowledge Transferring for Image Understanding