Abstract:The recent success of implicit neural scene representations has presented a viable new method for how we capture and store 3D scenes. Unlike conventional 3D representations, such as point clouds, which explicitly store scene properties in discrete, localized units, these implicit representations encode a scene in the weights of a neural network which can be queried at any coordinate to produce these same scene properties. Thus far, implicit representations have primarily been optimized to estimate only the appearance and/or 3D geometry information in a scene. We take the next step and demonstrate that an existing implicit representation (SRNs) is actually multi-modal; it can be further leveraged to perform per-point semantic segmentation while retaining its ability to represent appearance and geometry. To achieve this multi-modal behavior, we utilize a semi-supervised learning strategy atop the existing pre-trained scene representation. Our method is simple, general, and only requires a few tens of labeled 2D segmentation masks in order to achieve dense 3D semantic segmentation. We explore two novel applications for this semantically aware implicit neural scene representation: 3D novel view and semantic label synthesis given only a single input RGB image or 2D label mask, as well as 3D interpolation of appearance and semantics.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to learn a multi - modal neural scene representation that can represent 3D geometry, appearance and semantic information simultaneously from limited 2D observations. Specifically, the authors propose a method to perform semi - supervised learning on the basis of pre - trained scene representations by using a small number of manually - annotated 2D segmentation masks to achieve dense 3D semantic segmentation. This method can not only synthesize new views and semantic labels from a single 2D image or 2D label mask, but also maintain consistency and continuity between different viewpoints, which is crucial for applications that need to understand 3D scenes, such as robot grasping and autonomous driving. The main contributions of the paper include: - Developing a method that can learn a neural scene representation with semantic and 3D structure awareness. - Through a semi - supervised learning framework, demonstrating how to perform dense 3D semantic segmentation using only a small amount of 2D observation data (such as 30 semantic segmentation masks). - Demonstrating multi - view - consistent rendering results and semantic segmentation masks of 3D point clouds, including the parts of objects that are occluded in the observations. - Performing joint interpolation of geometry, appearance and semantic labels, and demonstrating how to infer the neural scene representation from color images or semantic segmentation masks. These contributions solve the problems of multi - view consistency and 3D scene understanding that traditional 2D methods cannot handle, providing new tools and methods for fields such as computer vision and robotics.

Semantic Implicit Neural Scene Representations With Semi-Supervised Training

SSR-2D: Semantic 3D Scene Reconstruction from 2D Images

Remote-Sensing Image Segmentation Based on Implicit 3-D Scene Representation.

NIS-SLAM: Neural Implicit Semantic RGB-D SLAM for 3D Consistent Scene Understanding

Learning the Implicit Semantic Representation on Graph-Structured Data

Semantic Reconstruction based on RGB Image and Sparse Depth

Self-Supervised Implicit 3D Reconstruction via RGB-D Scans.

Scene-Aware Deep Networks for Semantic Segmentation of Images.

Semantic Scene Mapping with Spatio-temporal Deep Neural Network for Robotic Applications

Semantically-Guided Representation Learning for Self-Supervised Monocular Depth

SNI-SLAM: Semantic Neural Implicit SLAM

Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance

Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

Learning 3D Semantic Segmentation with only 2D Image Supervision

Semantic Scene Completion using Local Deep Implicit Functions on LiDAR Data

CoCoNets: Continuous Contrastive 3D Scene Representations

3D Sketch-aware Semantic Scene Completion Via Semi-supervised Structure Prior

LISNeRF Mapping: LiDAR-based Implicit Mapping via Semantic Neural Fields for Large-Scale 3D Scenes

Joint 2D and 3D Semantic Segmentation with Consistent Instance Semantic