CLIP Multi-modal Hashing for Multimedia Retrieval

Jian Zhu,Mingkai Sheng,Zhangmin Huang,Jingfei Chang,Jinling Jiang,Jian Long,Cheng Luo,Lei Liu
2024-10-10
Abstract:Multi-modal hashing methods are widely used in multimedia retrieval, which can fuse multi-source data to generate binary hash code. However, the individual backbone networks have limited feature expression capabilities and are not jointly pre-trained on large-scale unsupervised multi-modal data, resulting in low retrieval accuracy. To address this issue, we propose a novel CLIP Multi-modal Hashing (CLIPMH) method. Our method employs the CLIP framework to extract both text and vision features and then fuses them to generate hash code. Due to enhancement on each modal feature, our method has great improvement in the retrieval performance of multi-modal hashing methods. Compared with state-of-the-art unsupervised and supervised multi-modal hashing methods, experiments reveal that the proposed CLIPMH can significantly improve performance (a maximum increase of 8.38% in mAP).
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the poor performance of multimodal hashing methods in multimedia retrieval. Specifically, existing multimodal hashing methods suffer from limited feature representation capabilities of backbone networks and lack of joint pre-training on large-scale unsupervised multimodal data, resulting in low retrieval accuracy. To solve this problem, the authors propose a new multimodal hashing method based on the CLIP framework (CLIPMH), which significantly improves the retrieval performance of multimodal hashing methods by extracting and fusing textual and visual features to generate hash codes. ### Main Contributions: 1. **Proposing the CLIPMH Method**: For the first time, a large-scale multimodal model (such as CLIP) is applied to multimodal hashing retrieval, addressing the issue of insufficient feature representation in existing methods. 2. **Utilizing the CLIP Framework**: High-quality visual and textual features are extracted through the CLIP framework, and hash codes are generated through a multimodal fusion module, significantly enhancing retrieval performance. 3. **Experimental Validation**: Experimental results on multiple benchmark datasets (such as MIR-Flickr25K, NUS-WIDE, and MS COCO) show that CLIPMH achieves significant performance improvements over existing methods, with a maximum increase of up to 8.38% in mAP value. ### Problems Addressed: - **Insufficient Feature Representation**: The backbone networks of existing methods lack good feature representation capabilities, leading to low retrieval accuracy. - **Lack of Joint Pre-training**: The backbone networks of existing methods are trained on individual modalities without joint pre-training, resulting in insufficient semantic alignment. - **Semantic Gap**: Existing methods fail to effectively address the semantic gap between different modalities, affecting overall retrieval performance. By introducing the CLIP framework, the CLIPMH method achieves significant progress in feature extraction and semantic alignment, thereby greatly enhancing the retrieval performance of multimodal hashing methods.