Model2Scene: Learning 3D Scene Representation via Contrastive Language-CAD Models Pre-training

Runnan Chen,Xinge Zhu,Nenglun Chen,Dawei Wang,Wei Li,Yuexin Ma,Ruigang Yang,Tongliang Liu,Wenping Wang
DOI: https://doi.org/10.48550/arXiv.2309.16956
2023-09-29
Abstract:Current successful methods of 3D scene perception rely on the large-scale annotated point cloud, which is tedious and expensive to acquire. In this paper, we propose Model2Scene, a novel paradigm that learns free 3D scene representation from Computer-Aided Design (CAD) models and languages. The main challenges are the domain gaps between the CAD models and the real scene's objects, including model-to-scene (from a single model to the scene) and synthetic-to-real (from synthetic model to real scene's object). To handle the above challenges, Model2Scene first simulates a crowded scene by mixing data-augmented CAD models. Next, we propose a novel feature regularization operation, termed Deep Convex-hull Regularization (DCR), to project point features into a unified convex hull space, reducing the domain gap. Ultimately, we impose contrastive loss on language embedding and the point features of CAD models to pre-train the 3D network. Extensive experiments verify the learned 3D scene representation is beneficial for various downstream tasks, including label-free 3D object salient detection, label-efficient 3D scene perception and zero-shot 3D semantic segmentation. Notably, Model2Scene yields impressive label-free 3D object salient detection with an average mAP of 46.08\% and 55.49\% on the ScanNet and S3DIS datasets, respectively. The code will be publicly available.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to reduce the need for large - scale annotated point cloud data in 3D scene perception and improve the generalization ability across datasets. Specifically, currently successful 3D scene perception methods rely on large - scale annotated point cloud data, and the acquisition of such data is both cumbersome and expensive. In addition, most methods perform well in specific scenarios but poorly in other scenarios with large domain differences. Therefore, the paper proposes a new paradigm - Model2Scene, aiming to learn free 3D scene representations from Computer - Aided Design (CAD) models and languages, in order to reduce the dependence on large - scale annotated data and improve the generalization ability across datasets. ### Main Challenges 1. **Model - to - Scene Gap**: CAD models are usually independent and complete, while objects in real - world scenes have diverse postures, sizes and positions, and may be occluded by other objects. 2. **Synthetic - to - Real Gap**: The surfaces of CAD models are usually clean and smooth, while the surfaces of real - scanned objects are irregular and noisy due to the influence of scanning devices. ### Solutions 1. **Crowded Scene Simulation**: Simulate a crowded scene by mixing enhanced CAD models to reduce the model - to - scene gap. 2. **Deep Convex - hull Regularization (DCR)**: Propose a new feature regularization operation to project point features into a unified convex - hull space, further reducing the domain gap. 3. **Visual - Language Contrastive Learning**: Pretrain the language embeddings and point features of CAD models through a contrastive loss function, enabling the network to learn better 3D scene representations. ### Experimental Results - **Unlabeled 3D Object Salient Detection**: On the ScanNet and S3DIS datasets, Model2Scene achieved an average mAP of 46.08% and 55.49% respectively. - **Label - Efficient 3D Scene Perception**: In the case of a small amount of annotated data, Model2Scene performs better than other methods. - **Zero - Shot 3D Semantic Segmentation**: Model2Scene shows preliminary zero - shot capabilities and can perform perception on unseen objects. ### Contributions - Propose a new paradigm, Model2Scene, to learn 3D scene representations from CAD models and languages. - Propose a new deep convex - hull regularization method to handle the domain gap between CAD models and real - scene objects. - Achieve satisfactory results in unlabeled 3D object salient detection, label - efficient 3D perception, and zero - shot 3D semantic segmentation tasks. Through these methods, Model2Scene not only reduces the dependence on large - scale annotated data but also improves the generalization ability of the model on different datasets.