Abstract:Unsupervised cross-modal hashing (UCMH) has been commonly explored to support large-scale cross-modal retrieval of unlabeled data. Despite promising progress, most existing approaches are developed on convolutional neural network and multilayer perceptron architectures, sacrificing the quality of hash codes due to limited capacity for excavating multi-modal semantics. To pursue better content understanding, we break this convention for UCMH and delve into a transformer-based paradigm. Unlike naïve adaptations via backbone substitution that overlook the heterogeneous semantics from transformers, we propose a multi-granularity learning framework called hugging to bridge the modality gap. Specifically, we first construct a fine-grained semantic space composed of a series of aggregated local embeddings that capture implicit attribute-level semantics. In the hash learning stage, we innovatively incorporate fine-grained alignment with these local embeddings to enhance global hash code alignment. Notably, this fine-grained alignment only facilitates robust cross-modal learning without complicating global hash code generation at test time, thus fully maintaining the high efficiency of hash-based retrieval. To make the most of fine-grained information, we further propose a differentiable optimized quantization algorithm and extend our framework to hugging ^+ . This variant neatly integrates quantization learning into the fine-grained alignment during training, producing quantization codes of local embeddings as a gift at test time, which can augment the retrieval performance through an efficient reranking stage. We instantiate simple baselines with contrastive learning objectives for hugging and hugging ^+ , namely HuggingHash and HuggingHash ^+ . Extensive experiments on 4 text-image retrieval and 2 text-video retrieval benchmark datasets show the competitive performance of HuggingHash and HuggingHash ^+ against state-of-the-art baselines. More encouragingly, we also validate that hugging and hugging ^+ are flexible and effective across various baselines, suggesting their universal applicability in the realm of UCMH.

Transformer-Based Deep Hashing Method for Multi-Scale Feature Fusion

Deep Hashing with Top Similarity Preserving for Image Retrieval

Multi-scale Fusion Transformer Based Weakly Supervised Hashing Learning for Instance Retrieval

Depth Image Hashing Algorithm Based on Local Global Feature Fusion

Multi-Scale Feature Fusion Based on PVTv2 for Deep Hash Remote Sensing Image Retrieval

HybridHash: Hybrid Convolutional and Self-Attention Deep Hashing for Image Retrieval

Multi-Modal Image Fusion Via Deep Laplacian Pyramid Hybrid Network

Cross-modal retrieval based on multi-dimensional feature fusion hashing

Deep CNN based binary hash video representations for face retrieval.

DeepHash for Image Instance Retrieval: Getting Regularization, Depth and Fine-Tuning Right.

Deep Multi-View Enhancement Hashing for Image Retrieval.

Improving visual grounding with multi-scale discrepancy information and centralized-transformer

Enhancing Multi-Label Deep Hashing for Image and Audio With Joint Internal Global Loss Constraints and Large Vision-Language Model

HHF: Hashing-guided Hinge Function for Deep Hashing Retrieval

HDCCT: Hybrid Densely Connected CNN and Transformer for Infrared and Visible Image Fusion

Deep Hierarchical Vision Transformer for Hyperspectral and LiDAR Data Classification

Self-Supervised Video Hashing Via Bidirectional Transformers.

Hugs Bring Double Benefits: Unsupervised Cross-Modal Hashing with Multi-granularity Aligned Transformers

Deep Semantic-Preserving and Ranking-Based Hashing for Image Retrieval.

Large-Scale Multi-Task Image Labeling with Adaptive Relevance Discovery and Feature Hashing

Multi-modal discrete tensor decomposition hashing for efficient multimedia retrieval