Multimodal contrastive learning using point clouds and their rendered images

Wonyong Lee,Hyungki Kim
DOI: https://doi.org/10.1007/s11042-024-18653-7
IF: 2.577
2024-02-27
Multimedia Tools and Applications
Abstract:In this paper, we propose a novel unsupervised pre-training method for point cloud deep learning models using multimodal contrastive learning. Point clouds, which consist of a set of three-dimensional coordinate points acquired from 3D scanners, lidars, depth cameras, etc. play an important role in representing 3D scenes, and understanding them is crucial for implementing autonomous driving or navigation. Deep learning models based on supervised learning for point cloud understanding require a label for each point cloud data that corresponds to the correct answer in training. However, generating these labels is expensive, making it difficult to build large datasets, which is essential for good model performance. Our proposed unsupervised pre-training method, on the other hand, does not require labels and can serve as an initial value for a model that can alleviate the need for such large datasets. The proposed method is characterized as a multimodal approach that utilizes two modalities for point clouds: the point cloud itself and an image rendering of the point cloud. By using images that directly render the point clouds, the shape information of the point clouds from various viewpoints can be obtained from the images without additional data such as meshes. We pre-trained the model with the proposed method and conducted performance comparison on ModelNet40 and ScanObjectNN datasets. The linear classification accuracy of the point cloud feature vector extracted by the pre-trained model was 91.5% and 83.9%, and after fine-tuning for each dataset, the classification accuracy was 93.3% and 86.9%, respectively.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?