Abstract:As multi-modal data proliferates, people are no longer content with a single mode of data retrieval for access to information. Deep hashing retrieval algorithms have attracted much attention for their advantages of efficient storage and fast query speed. Currently, the existing unsupervised hashing methods generally have two limitations: (1) Existing methods fail to adequately capture the latent semantic relevance and coexistent information from the different modality data, resulting in the lack of effective feature and hash encoding representation to bridge the heterogeneous and semantic gaps in multi-modal data. (2) Existing unsupervised methods typically construct a similarity matrix to guide the hash code learning, which suffers from inaccurate similarity problems, resulting in sub-optimal retrieval performance. To address these issues, we propose a novel CLIP-based fusion-modal reconstructing hashing for Large-scale Unsupervised Cross-modal Retrieval. First, we use CLIP to encode cross-modal features of visual modalities, and learn the common representation space of the hash code using modality-specific autoencoders. Second, we propose an efficient fusion approach to construct a semantically complementary affinity matrix that can maximize the potential semantic relevance of different modal instances. Furthermore, to retain the intrinsic semantic similarity of all similar pairs in the learned hash codes, an objective function for similarity reconstruction based on semantic complementation is designed to learn high-quality hash code representations. Sufficient experiments were carried out on four multi-modal benchmark datasets (WIKI, MIRFLICKR, NUS-WIDE, and MS COCO), and the proposed method achieves state-of-the-art image-text retrieval performance compared to several representative unsupervised cross-modal hashing methods.

Self-auxiliary Hashing for Unsupervised Cross Modal Retrieval

Efficient Discrete Supervised Hashing for Large-scale Cross-modal Retrieval

Semantic Consistency Hashing for Cross-Modal Retrieval

Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval

Unsupervised Multi-modal Hashing for Cross-Modal Retrieval

Self-supervised incomplete cross-modal hashing retrieval

Self-Attentive CLIP Hashing for Unsupervised Cross-Modal Retrieval.

Deep Semantic-Alignment Hashing for Unsupervised Cross-Modal Retrieval

Semi-Supervised Semantic-Preserving Hashing For Efficient Cross-Modal Retrieval

Self-supervised Learning-Based Weight Adaptive Hashing for Fast Cross-Modal Retrieval

Attention-Guided Semantic Hashing for Unsupervised Cross-Modal Retrieval

Asymmetric Supervised Consistent and Specific Hashing for Cross-Modal Retrieval

High-order nonlocal Hashing for unsupervised cross-modal retrieval

CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval

Semi-supervised Semi-paired Cross-modal Hashing

Supervised Intra- and Inter-Modality Similarity Preserving Hashing for Cross-Modal Retrieval.

Unsupervised Deep Imputed Hashing for Partial Cross-modal Retrieval

Unsupervised Joint-Semantics Autoencoder Hashing for Multimedia Retrieval

Unsupervised Cross-modal Hashing with Modality-interaction