Abstract:Unsupervised cross-modal hashing (UCMH) has been commonly explored to support large-scale cross-modal retrieval of unlabeled data. Despite promising progress, most existing approaches are developed on convolutional neural network and multilayer perceptron architectures, sacrificing the quality of hash codes due to limited capacity for excavating multi-modal semantics. To pursue better content understanding, we break this convention for UCMH and delve into a transformer-based paradigm. Unlike naïve adaptations via backbone substitution that overlook the heterogeneous semantics from transformers, we propose a multi-granularity learning framework called hugging to bridge the modality gap. Specifically, we first construct a fine-grained semantic space composed of a series of aggregated local embeddings that capture implicit attribute-level semantics. In the hash learning stage, we innovatively incorporate fine-grained alignment with these local embeddings to enhance global hash code alignment. Notably, this fine-grained alignment only facilitates robust cross-modal learning without complicating global hash code generation at test time, thus fully maintaining the high efficiency of hash-based retrieval. To make the most of fine-grained information, we further propose a differentiable optimized quantization algorithm and extend our framework to hugging ^+ . This variant neatly integrates quantization learning into the fine-grained alignment during training, producing quantization codes of local embeddings as a gift at test time, which can augment the retrieval performance through an efficient reranking stage. We instantiate simple baselines with contrastive learning objectives for hugging and hugging ^+ , namely HuggingHash and HuggingHash ^+ . Extensive experiments on 4 text-image retrieval and 2 text-video retrieval benchmark datasets show the competitive performance of HuggingHash and HuggingHash ^+ against state-of-the-art baselines. More encouragingly, we also validate that hugging and hugging ^+ are flexible and effective across various baselines, suggesting their universal applicability in the realm of UCMH.

Hugs Bring Double Benefits: Unsupervised Cross-Modal Hashing with Multi-granularity Aligned Transformers

Discrete Cross-Modal Hashing for Efficient Multimedia Retrieval

Efficient Discrete Supervised Hashing for Large-scale Cross-modal Retrieval

Nonlinear Discrete Cross-Modal Hashing for Visual-Textual Data

Unsupervised Multi-modal Hashing for Cross-Modal Retrieval

Deep Unified Cross-Modality Hashing by Pairwise Data Alignment

Unsupervised Cross-modal Hashing with Modality-interaction

High-order nonlocal Hashing for unsupervised cross-modal retrieval

Robust Unsupervised Cross-modal Hashing for Multimedia Retrieval

Scalable Unsupervised Hashing via Exploiting Robust Cross-modal Consistency

Unsupervised Video Hashing with Multi-granularity Contextualization and Multi-structure Preservation

Deep Cross-Modal Hashing With Hashing Functions and Unified Hash Codes Jointly Learning

Unsupervised Contrastive Cross-Modal Hashing

Unsupervised Dual Deep Hashing with Semantic-Index and Content-Code for Cross-Modal Retrieval

Unsupervised Online Cross-modal Hashing with Multiple Association Exploitation

Unsupervised Cross-modal Hashing via Semantic Text Mining

Deep Multi-Level Semantic Hashing for Cross-Modal Retrieval