Hugs Bring Double Benefits: Unsupervised Cross-Modal Hashing with Multi-granularity Aligned Transformers

Jinpeng Wang,Ziyun Zeng,Bin Chen,Yuting Wang,Dongliang Liao,Gongfu Li,Yiru Wang,Shu-Tao Xia
DOI: https://doi.org/10.1007/s11263-024-02009-7
IF: 13.369
2024-01-01
International Journal of Computer Vision
Abstract:Unsupervised cross-modal hashing (UCMH) has been commonly explored to support large-scale cross-modal retrieval of unlabeled data. Despite promising progress, most existing approaches are developed on convolutional neural network and multilayer perceptron architectures, sacrificing the quality of hash codes due to limited capacity for excavating multi-modal semantics. To pursue better content understanding, we break this convention for UCMH and delve into a transformer-based paradigm. Unlike naïve adaptations via backbone substitution that overlook the heterogeneous semantics from transformers, we propose a multi-granularity learning framework called hugging to bridge the modality gap. Specifically, we first construct a fine-grained semantic space composed of a series of aggregated local embeddings that capture implicit attribute-level semantics. In the hash learning stage, we innovatively incorporate fine-grained alignment with these local embeddings to enhance global hash code alignment. Notably, this fine-grained alignment only facilitates robust cross-modal learning without complicating global hash code generation at test time, thus fully maintaining the high efficiency of hash-based retrieval. To make the most of fine-grained information, we further propose a differentiable optimized quantization algorithm and extend our framework to hugging ^+ . This variant neatly integrates quantization learning into the fine-grained alignment during training, producing quantization codes of local embeddings as a gift at test time, which can augment the retrieval performance through an efficient reranking stage. We instantiate simple baselines with contrastive learning objectives for hugging and hugging ^+ , namely HuggingHash and HuggingHash ^+ . Extensive experiments on 4 text-image retrieval and 2 text-video retrieval benchmark datasets show the competitive performance of HuggingHash and HuggingHash ^+ against state-of-the-art baselines. More encouragingly, we also validate that hugging and hugging ^+ are flexible and effective across various baselines, suggesting their universal applicability in the realm of UCMH.
What problem does this paper attempt to address?