Cooperative Holistic 3D Scene Understanding from a Single RGB Image

Siyuan Huang,Siyuan Qi,Yinxue Xiao,Yixin Zhu,Ying Nian Wu,Song-Chun Zhu
2018-01-01
Abstract:Holistic 3D indoor scene understanding involves jointly recovering the room layout, camera pose, and object bounding boxes, all in 3D. Most current methods either are inefficient or only tackle part of the problem. In this paper, we propose an end-to-end model that simultaneously solves all three tasks in real-time given a single RGB image. The key idea is to improve the prediction by i) parametrizing the targets (e.g., 3D boxes) instead of directly estimating the targets, and ii) cooperative training among different modules. Specifically, we parametrize the 3D object bounding boxes by the predictions from several modules, i.e., 3D camera pose, depth, and object poses and sizes. The proposed method brings up three major advantages. i) The parametrization helps maintain the consistency between 2D images and 3D world. ii) It largely reduces the prediction variances of the 3D coordinates. iii) Constraints can be imposed on the parametrizations to train different modules simultaneously. We call these constraints cooperative losses. In this paper, we employ three cooperative losses for 3D bounding boxes, 2D projection, and physical constraints to estimate a geometrically consistent and physically plausible 3D scene. Experiments on the SUN RGB-D dataset shows that our method significantly outperforms prior approaches on 3D layout estimation, 3D object detection, and holistic scene understanding.
What problem does this paper attempt to address?