Abstract:In this paper, we investigate the use of diffusion models which are pre-trained on large-scale image-caption pairs for open-vocabulary 3D semantic understanding. We propose a novel method, namely Diff2Scene, which leverages frozen representations from text-image generative models, along with salient-aware and geometric-aware masks, for open-vocabulary 3D semantic segmentation and visual grounding tasks. Diff2Scene gets rid of any labeled 3D data and effectively identifies objects, appearances, materials, locations and their compositions in 3D scenes. We show that it outperforms competitive baselines and achieves significant improvements over state-of-the-art methods. In particular, Diff2Scene improves the state-of-the-art method on ScanNet200 by 12%.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are 3D semantic segmentation and visual localization tasks in an open - vocabulary setting. Specifically, existing 3D semantic understanding models are usually trained in a closed - set, that is, they can only predict the classes that have appeared in the training set. However, in practical applications, we hope to be able to handle unseen classes and complex combined queries (for example, "find the white sneakers closer to the office chair"). To this end, this paper proposes a new method named Diff2Scene, aiming to achieve 3D semantic segmentation in an open - vocabulary setting through text - to - image diffusion models. ### Core of the problem 1. **3D semantic segmentation in an open - vocabulary setting**: How to handle unseen classes and complex combined queries without annotated 3D data. 2. **Lack of annotated data**: Data of 3D point clouds and their dense labels are very scarce, which limits the training and generalization ability of the model. ### Solutions To address the above challenges, the paper proposes the following solutions: - **Utilizing pre - trained diffusion models**: Diff2Scene utilizes pre - trained text - to - image diffusion models (such as Stable Diffusion). These models have been trained on large - scale text - image pairs and are thus able to generate rich semantic representations. - **Mask distillation method**: A new mask distillation method is proposed to transfer the knowledge of the 2D branch (a 2D semantic segmentation model based on the diffusion model) to the 3D branch (a geometry - aware 3D mask model). This method does not require any annotated 3D data. - **Multimodal fusion**: Combine the saliency patterns in RGB images and the geometric information in point clouds to generate high - quality 3D semantic segmentation results. ### Formula explanation The formulas involved in the paper are mainly used to describe the specific operations of the model, for example: - Calculate the inner product between 3D features and 2D mask embeddings to generate logits: \[ S_i=\langle F_{3d}, f_{2d}^i\rangle \] - Use cosine similarity as the multimodal mask distillation loss function: \[ L = \sum_{i = 1}^{N}1-\cos(B_{3d}^i, B_{3d}'^i) \] Through these methods, Diff2Scene can effectively perform 3D semantic segmentation in an open - vocabulary setting and has achieved results significantly better than existing methods on multiple benchmark datasets.

Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models

Open-vocabulary Object Segmentation with Diffusion Models

Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

Diffusion Models for Open-Vocabulary Segmentation

FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

MaskDiffusion: Exploiting Pre-Trained Diffusion Models for Semantic Segmentation

Unleashing Text-to-Image Diffusion Models for Visual Perception

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation

Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene Understanding

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models

SceneDiff: Generative Scene-Level Image Retrieval with Text and Sketch Using Diffusion Models

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation

Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection

Diffusion Features to Bridge Domain Gap for Semantic Segmentation

DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery.

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models