Hierarchical modal interaction balance cross-modal hashing for unsupervised image-text retrieval

Jie Zhang,Ziyong Lin,Xiaolong Jiang,Mingyong Li,Chao Wang
DOI: https://doi.org/10.1007/s11042-024-19371-w
IF: 2.577
2024-05-19
Multimedia Tools and Applications
Abstract:As multimedia technologies advance, untagged image-text data processing has become central in cross-modal retrieval. However, current methods often neglect three critical issues when learning hash codes: 1. Incomplete feature representation limits capturing diverse latent semantics. 2. Binary codes from quantisation loss lack overall constraints and global interaction. 3. Prioritizing retrieval performance overlooks modality robustness, leading to significant multi-modal retrieval disparities. To address these challenges, we introduce HMIB, an unsupervised cross-modal hashing algorithm. We leverage deep feature encoders with pre-trained models like CLIP and VGG, capturing latent semantic associations across natural language and image classification. A hierarchical interactive modal similarity generator introduces comprehensive process constraints and corrects ambiguous edge semantic data, enhancing robustness and generating high-quality hash codes. We conducted extensive experiments on three widely used datasets, maintaining high-level performance while minimizing cross-modal retrieval disparities.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?