Abstract:Multi-modal hashing methods are widely used in multimedia retrieval, which can fuse multi-source data to generate binary hash code. However, the individual backbone networks have limited feature expression capabilities and are not jointly pre-trained on large-scale unsupervised multi-modal data, resulting in low retrieval accuracy. To address this issue, we propose a novel CLIP Multi-modal Hashing (CLIPMH) method. Our method employs the CLIP framework to extract both text and vision features and then fuses them to generate hash code. Due to enhancement on each modal feature, our method has great improvement in the retrieval performance of multi-modal hashing methods. Compared with state-of-the-art unsupervised and supervised multi-modal hashing methods, experiments reveal that the proposed CLIPMH can significantly improve performance (a maximum increase of 8.38% in mAP).

What problem does this paper attempt to address?

This paper attempts to address the poor performance of multimodal hashing methods in multimedia retrieval. Specifically, existing multimodal hashing methods suffer from limited feature representation capabilities of backbone networks and lack of joint pre-training on large-scale unsupervised multimodal data, resulting in low retrieval accuracy. To solve this problem, the authors propose a new multimodal hashing method based on the CLIP framework (CLIPMH), which significantly improves the retrieval performance of multimodal hashing methods by extracting and fusing textual and visual features to generate hash codes. ### Main Contributions: 1. **Proposing the CLIPMH Method**: For the first time, a large-scale multimodal model (such as CLIP) is applied to multimodal hashing retrieval, addressing the issue of insufficient feature representation in existing methods. 2. **Utilizing the CLIP Framework**: High-quality visual and textual features are extracted through the CLIP framework, and hash codes are generated through a multimodal fusion module, significantly enhancing retrieval performance. 3. **Experimental Validation**: Experimental results on multiple benchmark datasets (such as MIR-Flickr25K, NUS-WIDE, and MS COCO) show that CLIPMH achieves significant performance improvements over existing methods, with a maximum increase of up to 8.38% in mAP value. ### Problems Addressed: - **Insufficient Feature Representation**: The backbone networks of existing methods lack good feature representation capabilities, leading to low retrieval accuracy. - **Lack of Joint Pre-training**: The backbone networks of existing methods are trained on individual modalities without joint pre-training, resulting in insufficient semantic alignment. - **Semantic Gap**: Existing methods fail to effectively address the semantic gap between different modalities, affecting overall retrieval performance. By introducing the CLIP framework, the CLIPMH method achieves significant progress in feature extraction and semantic alignment, thereby greatly enhancing the retrieval performance of multimodal hashing methods.

CLIP Multi-modal Hashing for Multimedia Retrieval

Discrete Cross-Modal Hashing for Efficient Multimedia Retrieval

CLIP Multi-modal Hashing: A new baseline CLIPMH

CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval

When CLIP meets cross-modal hashing retrieval: A new strong baseline

Unsupervised Multi-modal Hashing for Cross-Modal Retrieval

Multi-modal Hashing for Efficient Multimedia Retrieval: A Survey

Efficient Multi-modal Hashing with Online Query Adaption for Multimedia Retrieval

Fast Discrete Collaborative Multi-Modal Hashing for Large-Scale Multimedia Retrieval

One for more: Structured Multi-Modal Hashing for multiple multimedia retrieval tasks

Scalable Multimedia Retrieval By Deep Learning Hashing With Relative Similarity Learning

Cross-modal retrieval based on multi-dimensional feature fusion hashing

Deep Multi-Level Semantic Hashing for Cross-Modal Retrieval

Hierarchical modal interaction balance cross-modal hashing for unsupervised image-text retrieval

Central Similarity Multi-View Hashing for Multimedia Retrieval

Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals

Supervised Semantic-Embedded Hashing for Multimedia Retrieval

Linear cross-modal hashing for efficient multimedia search

Transitive Hashing Network for Heterogeneous Multimedia Retrieval

Multi-modal discrete tensor decomposition hashing for efficient multimedia retrieval

Asymmetric Supervised Consistent and Specific Hashing for Cross-Modal Retrieval