Towards Panoptic 3D Parsing for Single Image in the Wild

Sainan Liu,Vincent Nguyen,Yuan Gao,Subarna Tripathi,Zhuowen Tu
DOI: https://doi.org/10.48550/arXiv.2111.03039
2021-11-30
Abstract:Performing single image holistic understanding and 3D reconstruction is a central task in computer vision. This paper presents an integrated system that performs dense scene labeling, object detection, instance segmentation, depth estimation, 3D shape reconstruction, and 3D layout estimation for indoor and outdoor scenes from a single RGB image. We name our system panoptic 3D parsing (Panoptic3D) in which panoptic segmentation ("stuff" segmentation and "things" detection/segmentation) with 3D reconstruction is performed. We design a stage-wise system, Panoptic3D (stage-wise), where a complete set of annotations is absent. Additionally, we present an end-to-end pipeline, Panoptic3D (end-to-end), trained on a synthetic dataset with a full set of annotations. We show results on both indoor (3D-FRONT) and outdoor (COCO and Cityscapes) scenes. Our proposed panoptic 3D parsing framework points to a promising direction in computer vision. Panoptic3D can be applied to a variety of applications, including autonomous driving, mapping, robotics, design, computer graphics, robotics, human-computer interaction, and augmented reality.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is Panoptic 3D Parsing from a single natural image, which is a core task in computer vision. Specifically, the paper proposes an integrated system that can perform dense scene annotation, object detection, instance segmentation, depth estimation, 3D shape reconstruction, and 3D layout estimation on indoor and outdoor scenes from a single RGB image. The main contributions of the paper are as follows: 1. **Proposing a stage - wise system (Panoptic3D (stage - wise))** that can handle natural image datasets (such as COCO and Cityscapes) without complete annotations. These datasets lack complete segmentation and 3D reconstruction annotations. 2. **Developing an end - to - end pipeline (Panoptic3D (end - to - end))** that can be trained on datasets with complete annotations, such as the synthetic 3D - FRONT dataset. 3. **Demonstrating applications in indoor and outdoor scenes**, including fields such as autonomous driving, mapping, robotics, design, computer graphics, human - computer interaction, and augmented reality. By combining multiple existing techniques, such as UPSNet for panoptic segmentation, DenseDepth for depth prediction, GenRe for reconstruction of unseen - class objects, etc., the paper constructs a system that can extract rich 3D information from a single RGB image. This system can not only identify and segment background regions ("stuff") and foreground objects ("things"), but also perform 3D reconstruction on these objects and estimate the 3D layout of the entire scene.