PIXEL: Prompt-based Zero-shot Hashing Via Visual and Textual Semantic Alignment

Zeyu Dong,Qingqing Long,Yihang Zhou,Pengfei Wang,Zhihong Zhu,Xiao Luo,Yidong Wang,Pengyang Wang,Yuanchun Zhou
DOI: https://doi.org/10.1145/3627673.3679747
2024-01-01
Abstract:Zero-Shot Hashing (ZSH) has aroused significant attention due to its efficiency and generalizability in multi-modal retrieval scenarios, which aims to encode semantic information into hash codes without needing unseen labeled training samples. In addition to commonly used visual images as visual semantics and class labels as global semantics, the corresponding attribute descriptions contain critical local semantics with detailed information. However, most existing methods focus on leveraging the extracted attribute numerical values, without exploring the textual semantics in attribute descriptions. To bridge this gap, in this paper, we propose Prompt-based zero-shot hashing via vIsual and teXtual sEmantic aLignment, namely PIXEL. Concretely, we design the attribute prompt template depending on attribute descriptions to make the model capture the corresponding local semantics. Then, achieving the textual embedding and visual embedding, we proposed an alignment module to model the intra- and inter-class contrastive distances. In addition, the attribute-wise constraint and class-wise constraint are utilized to collaboratively learn the hash code, image representation, and visual attributes more effectively. Finally, extensive experimental results demonstrate the superiority of PIXEL.
What problem does this paper attempt to address?