Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models

Xiaoyu Zhu,Hao Zhou,Pengfei Xing,Long Zhao,Hao Xu,Junwei Liang,Alexander Hauptmann,Ting Liu,Andrew Gallagher
2024-07-19
Abstract:In this paper, we investigate the use of diffusion models which are pre-trained on large-scale image-caption pairs for open-vocabulary 3D semantic understanding. We propose a novel method, namely Diff2Scene, which leverages frozen representations from text-image generative models, along with salient-aware and geometric-aware masks, for open-vocabulary 3D semantic segmentation and visual grounding tasks. Diff2Scene gets rid of any labeled 3D data and effectively identifies objects, appearances, materials, locations and their compositions in 3D scenes. We show that it outperforms competitive baselines and achieves significant improvements over state-of-the-art methods. In particular, Diff2Scene improves the state-of-the-art method on ScanNet200 by 12%.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are 3D semantic segmentation and visual localization tasks in an open - vocabulary setting. Specifically, existing 3D semantic understanding models are usually trained in a closed - set, that is, they can only predict the classes that have appeared in the training set. However, in practical applications, we hope to be able to handle unseen classes and complex combined queries (for example, "find the white sneakers closer to the office chair"). To this end, this paper proposes a new method named Diff2Scene, aiming to achieve 3D semantic segmentation in an open - vocabulary setting through text - to - image diffusion models. ### Core of the problem 1. **3D semantic segmentation in an open - vocabulary setting**: How to handle unseen classes and complex combined queries without annotated 3D data. 2. **Lack of annotated data**: Data of 3D point clouds and their dense labels are very scarce, which limits the training and generalization ability of the model. ### Solutions To address the above challenges, the paper proposes the following solutions: - **Utilizing pre - trained diffusion models**: Diff2Scene utilizes pre - trained text - to - image diffusion models (such as Stable Diffusion). These models have been trained on large - scale text - image pairs and are thus able to generate rich semantic representations. - **Mask distillation method**: A new mask distillation method is proposed to transfer the knowledge of the 2D branch (a 2D semantic segmentation model based on the diffusion model) to the 3D branch (a geometry - aware 3D mask model). This method does not require any annotated 3D data. - **Multimodal fusion**: Combine the saliency patterns in RGB images and the geometric information in point clouds to generate high - quality 3D semantic segmentation results. ### Formula explanation The formulas involved in the paper are mainly used to describe the specific operations of the model, for example: - Calculate the inner product between 3D features and 2D mask embeddings to generate logits: \[ S_i=\langle F_{3d}, f_{2d}^i\rangle \] - Use cosine similarity as the multimodal mask distillation loss function: \[ L = \sum_{i = 1}^{N}1-\cos(B_{3d}^i, B_{3d}'^i) \] Through these methods, Diff2Scene can effectively perform 3D semantic segmentation in an open - vocabulary setting and has achieved results significantly better than existing methods on multiple benchmark datasets.