Contrastive masked auto-encoders based self-supervised hashing for 2D image and 3D point cloud cross-modal retrieval

Rukai Wei,Heng Cui,Yu Liu,Yufeng Hou,Yanzhao Xie,Ke Zhou
2024-08-11
Abstract:Implementing cross-modal hashing between 2D images and 3D point-cloud data is a growing concern in real-world retrieval systems. Simply applying existing cross-modal approaches to this new task fails to adequately capture latent multi-modal semantics and effectively bridge the modality gap between 2D and 3D. To address these issues without relying on hand-crafted labels, we propose contrastive masked autoencoders based self-supervised hashing (CMAH) for retrieval between images and point-cloud data. We start by contrasting 2D-3D pairs and explicitly constraining them into a joint Hamming space. This contrastive learning process ensures robust discriminability for the generated hash codes and effectively reduces the modality gap. Moreover, we utilize multi-modal auto-encoders to enhance the model's understanding of multi-modal semantics. By completing the masked image/point-cloud data modeling task, the model is encouraged to capture more localized clues. In addition, the proposed multi-modal fusion block facilitates fine-grained interactions among different modalities. Extensive experiments on three public datasets demonstrate that the proposed CMAH significantly outperforms all baseline methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the issue of cross-modal retrieval between 2D images and 3D point cloud data. Specifically, the research focuses on the following key points: 1. **Cross-modal Hashing**: The authors point out that in practical applications, such as autonomous driving and augmented reality, the need to quickly retrieve 3D point cloud data from 2D images or vice versa is increasingly growing. Although traditional cross-modal hashing methods have achieved good results in image-text and image-video scenarios, there is a significant modality gap between 2D images and 3D point cloud data, making it ineffective to directly apply existing methods to this new task. 2. **Modality Gap**: Due to the significant differences in the formation and description of 3D point cloud data and 2D images, effectively bridging the gap between these two modalities becomes a challenge. On one hand, the irregular and unordered structure of point cloud data makes it difficult to effectively capture meaningful semantic information; on the other hand, the feature changes and semantic differences between 2D pixels and 3D coordinates also hinder the accurate learning of correspondences between the two modalities. 3. **Self-supervised Learning**: To overcome the above challenges, the paper proposes a self-supervised hashing method based on a contrastive masked autoencoder (CMAH). This method combines multi-modal contrastive learning and masked image/point cloud modeling techniques to effectively capture local features and maintain global relationships. In this way, CMAH can not only generate highly discriminative hash codes but also effectively reduce the gap between modalities, thereby improving cross-modal retrieval performance. In summary, the paper aims to propose a new self-supervised cross-modal hashing method (CMAH) to address the cross-modal retrieval problem between 2D images and 3D point cloud data, with the goal of overcoming the modality gap challenges faced by existing methods and enhancing retrieval performance.