Abstract:Grounding natural language to the physical world is a ubiquitous topic with a wide range of applications in computer vision and robotics. Recently, 2D vision-language models such as CLIP have been widely popularized, due to their impressive capabilities for open-vocabulary grounding in 2D images. Recent works aim to elevate 2D CLIP features to 3D via feature distillation, but either learn neural fields that are scene-specific and hence lack generalization, or focus on indoor room scan data that require access to multiple camera views, which is not practical in robot manipulation scenarios. Additionally, related methods typically fuse features at pixel-level and assume that all camera views are equally informative. In this work, we show that this approach leads to sub-optimal 3D features, both in terms of grounding accuracy, as well as segmentation crispness. To alleviate this, we propose a multi-view feature fusion strategy that employs object-centric priors to eliminate uninformative views based on semantic information, and fuse features at object-level via instance segmentation masks. To distill our object-centric 3D features, we generate a large-scale synthetic multi-view dataset of cluttered tabletop scenes, spawning 15k scenes from over 3300 unique object instances, which we make publicly available. We show that our method reconstructs 3D CLIP features with improved grounding capacity and spatial consistency, while doing so from single-view RGB-D, thus departing from the assumption of multiple camera views at test time. Finally, we show that our approach can generalize to novel tabletop domains and be re-purposed for 3D instance segmentation without fine-tuning, and demonstrate its utility for language-guided robotic grasping in clutter.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on how to extract high - quality 3D features from multi - view 2D images and apply them to 3D scene understanding tasks under the guidance of natural language. Specifically, the paper focuses on the following aspects: 1. **Limitations of existing methods**: - Current methods have some problems in the process of converting from 2D images to 3D features, such as: - **Scene - specific neural fields**: These methods usually require online training and are optimized for specific scenes, lacking generalization ability. - **Multi - view dependence**: Many methods rely on data from multiple camera views, which is impractical in actual robotic operation scenarios. - **Pixel - level fusion**: Existing methods usually fuse 2D features at the pixel level, assuming that the information of all views is equivalent, resulting in low - quality 3D features, especially poor performance in segmentation accuracy and boundary clarity. 2. **Proposed new method**: - The paper proposes a new multi - view feature fusion strategy to improve the quality of 3D features by introducing object - centric priors. Specific improvements include: - **Object - level 2D feature extraction**: Isolate object instances in each view through instance segmentation masks and extract object - level 2D features. - **Semantic - information - based view selection**: Use dense object - level semantic information to design an information metric for weighting the contributions of different views and eliminating uninformative views. - **Feature fusion on 3D instance masks**: Only fuse features on the corresponding 3D object regions to ensure high - quality and spatial consistency of features. 3. **Construction of data set**: - In order to train and validate the new method, the paper constructs a large - scale synthetic multi - view data set (MV - TOD), which contains about 15,000 scenes. Each scene provides RGB - D images from 73 views, as well as rich annotation information, such as 2D/3D segmentation masks, 6 - degree - of - freedom grasping postures, etc. 4. **Experimental verification**: - Through extensive ablation studies and comparative experiments, the paper shows the superior performance of the new method in 3D semantic segmentation and referring segmentation tasks, especially significantly outperforming existing methods in the single - view setting. - In addition, the paper also verifies the effectiveness of the new method in zero - shot generalization ability and robotic operation tasks. In summary, the main goal of this paper is to improve the quality of 3D features and the application effect in 3D scene understanding tasks under the guidance of natural language by introducing object - centric priors and improving the conversion process from multi - view 2D features to 3D features.

3D Feature Distillation with Object-Centric Priors

3D-to-2D Distillation for Indoor Scene Parsing

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition

Towards CLIP-driven Language-free 3D Visual Grounding Via 2D-3D Relational Enhancement and Consistency

Distilling Focal Knowledge from Imperfect Expert for 3D Object Detection

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection

3D Object Recognition By Corresponding and Quantizing Neural 3D Scene Representations

CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP

Monocular 3D Object Detection with Motion Feature Distillation.

Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph

TPV-IGKD: Image-Guided Knowledge Distillation for 3D Semantic Segmentation with Tri-Plane-View

Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views

Learning 3D Scene Priors with 2D Supervision

X$^3$KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection

Attention-Based Depth Distillation with 3D-Aware Positional Encoding for Monocular 3D Object Detection

DistillGrasp: Integrating Features Correlation with Knowledge Distillation for Depth Completion of Transparent Objects

Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation

Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding