Abstract:Current successful methods of 3D scene perception rely on the large-scale annotated point cloud, which is tedious and expensive to acquire. In this paper, we propose Model2Scene, a novel paradigm that learns free 3D scene representation from Computer-Aided Design (CAD) models and languages. The main challenges are the domain gaps between the CAD models and the real scene's objects, including model-to-scene (from a single model to the scene) and synthetic-to-real (from synthetic model to real scene's object). To handle the above challenges, Model2Scene first simulates a crowded scene by mixing data-augmented CAD models. Next, we propose a novel feature regularization operation, termed Deep Convex-hull Regularization (DCR), to project point features into a unified convex hull space, reducing the domain gap. Ultimately, we impose contrastive loss on language embedding and the point features of CAD models to pre-train the 3D network. Extensive experiments verify the learned 3D scene representation is beneficial for various downstream tasks, including label-free 3D object salient detection, label-efficient 3D scene perception and zero-shot 3D semantic segmentation. Notably, Model2Scene yields impressive label-free 3D object salient detection with an average mAP of 46.08\% and 55.49\% on the ScanNet and S3DIS datasets, respectively. The code will be publicly available.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to reduce the need for large - scale annotated point cloud data in 3D scene perception and improve the generalization ability across datasets. Specifically, currently successful 3D scene perception methods rely on large - scale annotated point cloud data, and the acquisition of such data is both cumbersome and expensive. In addition, most methods perform well in specific scenarios but poorly in other scenarios with large domain differences. Therefore, the paper proposes a new paradigm - Model2Scene, aiming to learn free 3D scene representations from Computer - Aided Design (CAD) models and languages, in order to reduce the dependence on large - scale annotated data and improve the generalization ability across datasets. ### Main Challenges 1. **Model - to - Scene Gap**: CAD models are usually independent and complete, while objects in real - world scenes have diverse postures, sizes and positions, and may be occluded by other objects. 2. **Synthetic - to - Real Gap**: The surfaces of CAD models are usually clean and smooth, while the surfaces of real - scanned objects are irregular and noisy due to the influence of scanning devices. ### Solutions 1. **Crowded Scene Simulation**: Simulate a crowded scene by mixing enhanced CAD models to reduce the model - to - scene gap. 2. **Deep Convex - hull Regularization (DCR)**: Propose a new feature regularization operation to project point features into a unified convex - hull space, further reducing the domain gap. 3. **Visual - Language Contrastive Learning**: Pretrain the language embeddings and point features of CAD models through a contrastive loss function, enabling the network to learn better 3D scene representations. ### Experimental Results - **Unlabeled 3D Object Salient Detection**: On the ScanNet and S3DIS datasets, Model2Scene achieved an average mAP of 46.08% and 55.49% respectively. - **Label - Efficient 3D Scene Perception**: In the case of a small amount of annotated data, Model2Scene performs better than other methods. - **Zero - Shot 3D Semantic Segmentation**: Model2Scene shows preliminary zero - shot capabilities and can perform perception on unseen objects. ### Contributions - Propose a new paradigm, Model2Scene, to learn 3D scene representations from CAD models and languages. - Propose a new deep convex - hull regularization method to handle the domain gap between CAD models and real - scene objects. - Achieve satisfactory results in unlabeled 3D object salient detection, label - efficient 3D perception, and zero - shot 3D semantic segmentation tasks. Through these methods, Model2Scene not only reduces the dependence on large - scale annotated data but also improves the generalization ability of the model on different datasets.

Model2Scene: Learning 3D Scene Representation via Contrastive Language-CAD Models Pre-training

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP

MultiCAD: Contrastive Representation Learning for Multi-modal 3D Computer-Aided Design Models

Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning

Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection

Towards Label-free Scene Understanding by Vision Foundation Models

Saliency Guided Contrastive Learning on Scene Images

CO^3: Cooperative Unsupervised 3D Representation Learning for Autonomous Driving

Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning

SpatialScene2Vec: A self-supervised contrastive representation learning method for spatial scene similarity evaluation

CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP

Learning 3D Scene Priors with 2D Supervision

Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data

SSR-2D: Semantic 3D Scene Reconstruction from 2D Images

ContrastCAD: Contrastive Learning-Based Representation Learning for Computer-Aided Design Models

Point-GCC: Universal Self-supervised 3D Scene Pre-training via Geometry-Color Contrast

VLM2Scene: Self-Supervised Image-Text-LiDAR Learning with Foundation Models for Autonomous Driving Scene Understanding

Learning Deep Object Detectors from 3D Models

Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation

Mask2CAD: 3D Shape Prediction by Learning to Segment and Retrieve