3D Feature Distillation with Object-Centric Priors

Georgios Tziafas,Yucheng Xu,Zhibin Li,Hamidreza Kasaei
2024-10-06
Abstract:Grounding natural language to the physical world is a ubiquitous topic with a wide range of applications in computer vision and robotics. Recently, 2D vision-language models such as CLIP have been widely popularized, due to their impressive capabilities for open-vocabulary grounding in 2D images. Recent works aim to elevate 2D CLIP features to 3D via feature distillation, but either learn neural fields that are scene-specific and hence lack generalization, or focus on indoor room scan data that require access to multiple camera views, which is not practical in robot manipulation scenarios. Additionally, related methods typically fuse features at pixel-level and assume that all camera views are equally informative. In this work, we show that this approach leads to sub-optimal 3D features, both in terms of grounding accuracy, as well as segmentation crispness. To alleviate this, we propose a multi-view feature fusion strategy that employs object-centric priors to eliminate uninformative views based on semantic information, and fuse features at object-level via instance segmentation masks. To distill our object-centric 3D features, we generate a large-scale synthetic multi-view dataset of cluttered tabletop scenes, spawning 15k scenes from over 3300 unique object instances, which we make publicly available. We show that our method reconstructs 3D CLIP features with improved grounding capacity and spatial consistency, while doing so from single-view RGB-D, thus departing from the assumption of multiple camera views at test time. Finally, we show that our approach can generalize to novel tabletop domains and be re-purposed for 3D instance segmentation without fine-tuning, and demonstrate its utility for language-guided robotic grasping in clutter.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on how to extract high - quality 3D features from multi - view 2D images and apply them to 3D scene understanding tasks under the guidance of natural language. Specifically, the paper focuses on the following aspects: 1. **Limitations of existing methods**: - Current methods have some problems in the process of converting from 2D images to 3D features, such as: - **Scene - specific neural fields**: These methods usually require online training and are optimized for specific scenes, lacking generalization ability. - **Multi - view dependence**: Many methods rely on data from multiple camera views, which is impractical in actual robotic operation scenarios. - **Pixel - level fusion**: Existing methods usually fuse 2D features at the pixel level, assuming that the information of all views is equivalent, resulting in low - quality 3D features, especially poor performance in segmentation accuracy and boundary clarity. 2. **Proposed new method**: - The paper proposes a new multi - view feature fusion strategy to improve the quality of 3D features by introducing object - centric priors. Specific improvements include: - **Object - level 2D feature extraction**: Isolate object instances in each view through instance segmentation masks and extract object - level 2D features. - **Semantic - information - based view selection**: Use dense object - level semantic information to design an information metric for weighting the contributions of different views and eliminating uninformative views. - **Feature fusion on 3D instance masks**: Only fuse features on the corresponding 3D object regions to ensure high - quality and spatial consistency of features. 3. **Construction of data set**: - In order to train and validate the new method, the paper constructs a large - scale synthetic multi - view data set (MV - TOD), which contains about 15,000 scenes. Each scene provides RGB - D images from 73 views, as well as rich annotation information, such as 2D/3D segmentation masks, 6 - degree - of - freedom grasping postures, etc. 4. **Experimental verification**: - Through extensive ablation studies and comparative experiments, the paper shows the superior performance of the new method in 3D semantic segmentation and referring segmentation tasks, especially significantly outperforming existing methods in the single - view setting. - In addition, the paper also verifies the effectiveness of the new method in zero - shot generalization ability and robotic operation tasks. In summary, the main goal of this paper is to improve the quality of 3D features and the application effect in 3D scene understanding tasks under the guidance of natural language by introducing object - centric priors and improving the conversion process from multi - view 2D features to 3D features.