Abstract:Implementing cross-modal hashing between 2D images and 3D point-cloud data is a growing concern in real-world retrieval systems. Simply applying existing cross-modal approaches to this new task fails to adequately capture latent multi-modal semantics and effectively bridge the modality gap between 2D and 3D. To address these issues without relying on hand-crafted labels, we propose contrastive masked autoencoders based self-supervised hashing (CMAH) for retrieval between images and point-cloud data. We start by contrasting 2D-3D pairs and explicitly constraining them into a joint Hamming space. This contrastive learning process ensures robust discriminability for the generated hash codes and effectively reduces the modality gap. Moreover, we utilize multi-modal auto-encoders to enhance the model's understanding of multi-modal semantics. By completing the masked image/point-cloud data modeling task, the model is encouraged to capture more localized clues. In addition, the proposed multi-modal fusion block facilitates fine-grained interactions among different modalities. Extensive experiments on three public datasets demonstrate that the proposed CMAH significantly outperforms all baseline methods.

What problem does this paper attempt to address?

The paper primarily addresses the issue of cross-modal retrieval between 2D images and 3D point cloud data. Specifically, the research focuses on the following key points: 1. **Cross-modal Hashing**: The authors point out that in practical applications, such as autonomous driving and augmented reality, the need to quickly retrieve 3D point cloud data from 2D images or vice versa is increasingly growing. Although traditional cross-modal hashing methods have achieved good results in image-text and image-video scenarios, there is a significant modality gap between 2D images and 3D point cloud data, making it ineffective to directly apply existing methods to this new task. 2. **Modality Gap**: Due to the significant differences in the formation and description of 3D point cloud data and 2D images, effectively bridging the gap between these two modalities becomes a challenge. On one hand, the irregular and unordered structure of point cloud data makes it difficult to effectively capture meaningful semantic information; on the other hand, the feature changes and semantic differences between 2D pixels and 3D coordinates also hinder the accurate learning of correspondences between the two modalities. 3. **Self-supervised Learning**: To overcome the above challenges, the paper proposes a self-supervised hashing method based on a contrastive masked autoencoder (CMAH). This method combines multi-modal contrastive learning and masked image/point cloud modeling techniques to effectively capture local features and maintain global relationships. In this way, CMAH can not only generate highly discriminative hash codes but also effectively reduce the gap between modalities, thereby improving cross-modal retrieval performance. In summary, the paper aims to propose a new self-supervised cross-modal hashing method (CMAH) to address the cross-modal retrieval problem between 2D images and 3D point cloud data, with the goal of overcoming the modality gap challenges faced by existing methods and enhancing retrieval performance.

Contrastive masked auto-encoders based self-supervised hashing for 2D image and 3D point cloud cross-modal retrieval

Discrete Cross-Modal Hashing for Efficient Multimedia Retrieval

Nonlinear Discrete Cross-Modal Hashing for Visual-Textual Data

Efficient Discrete Supervised Hashing for Large-scale Cross-modal Retrieval

Asymmetric Supervised Consistent and Specific Hashing for Cross-Modal Retrieval

High-order nonlocal Hashing for unsupervised cross-modal retrieval

Multi-Task Consistency-Preserving Adversarial Hashing for Cross-Modal Retrieval

Dense Auto-Encoder Hashing for Robust Cross-Modality Retrieval

Unsupervised Multi-modal Hashing for Cross-Modal Retrieval

Multi-Level Correlation Adversarial Hashing for Cross-Modal Retrieval.

Unsupervised Contrastive Cross-Modal Hashing

Structure-aware contrastive hashing for unsupervised cross-modal retrieval

Deep consistency-preserving hash auto-encoders for neuroimage cross-modal retrieval

Adaptive Marginalized Semantic Hashing for Unpaired Cross-Modal Retrieval

Cross-modal retrieval based on multi-dimensional feature fusion hashing

Sequential Discrete Hashing for Scalable Cross-Modality Similarity Retrieval

Correlation Autoencoder Hashing for Supervised Cross-Modal Search

CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval

Multi-Relational Deep Hashing for Cross-Modal Search

Deep Manifold Hashing: A Divide-and-Conquer Approach for Semi-Paired Unsupervised Cross-Modal Retrieval