3DFS: Deformable Dense Depth Fusion and Segmentation for Object Reconstruction from a Handheld Camera

Tanmay Gupta,Daeyun Shin,Naren Sivagnanadasan,Derek Hoiem
DOI: https://doi.org/10.48550/arXiv.1606.05002
2016-07-28
Abstract:We propose an approach for 3D reconstruction and segmentation of a single object placed on a flat surface from an input video. Our approach is to perform dense depth map estimation for multiple views using a proposed objective function that preserves detail. The resulting depth maps are then fused using a proposed implicit surface function that is robust to estimation error, producing a smooth surface reconstruction of the entire scene. Finally, the object is segmented from the remaining scene using a proposed 2D-3D segmentation that incorporates image and depth cues with priors and regularization over the 3D volume and 2D segmentations. We evaluate 3D reconstructions qualitatively on our Object-Videos dataset, comparing to fusion, multiview stereo, and segmentation baselines. We also quantitatively evaluate the dense depth estimation using the RGBD Scenes V2 dataset [Henry et al. 2013] and the segmentation using keyframe annotations of the Object-Videos dataset.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper aims to solve the problems of reconstructing the 3D model of a single object and segmentation in videos obtained from hand - held cameras. Specifically, the paper proposes a method that can perform 3D reconstruction and segmentation from an input video of a single object placed on a flat surface. This method preserves details through multi - view dense depth map estimation and fuses these depth maps using an implicit surface function that is robust to estimation errors, ultimately generating a smooth surface reconstruction of the entire scene. In addition, the segmentation of the object from the remaining scene is achieved by combining image and depth cues as well as priors and regularizations on 3D volumes and 2D segmentation. ### Main contributions of the paper: 1. **Fully - automatic 3D model generation**: A fully - automatic method is proposed that can generate a 3D mesh from a single object placed on a flat surface. 2. **Robust depth surface estimation**: A method is proposed to robustly estimate the depth surface through multi - view stereo vision cues and sparse point clouds (optional), and is regularized by rotation - invariant bending energy. 3. **Improved volume fusion method**: The volume fusion method of depth maps is re - defined. By using soft - max instead of truncation and weighted average to calculate TSDF, the robustness to depth map errors is improved and a smoother surface is generated. 4. **Joint 2D - 3D segmentation**: 2D image segmentation and volume 3D reconstruction are jointly modeled as a discrete label assignment problem. Each pixel and voxel is respectively assigned a label of "object" or "background", which is solved through a graph - cut optimization framework. 5. **System evaluation**: The performance of the entire system is evaluated by the pixel - segmentation accuracy on the Object - Videos dataset, which contains 12 object videos recorded by mobile phone cameras and provides ground - truth segmentation masks for selected frames. ### Key problems solved: - **Accuracy of depth map estimation**: Existing methods have difficulty in consistently generating high - quality 3D models when dealing with objects with specular reflection and irregular shapes. This paper improves the accuracy of depth maps by formalizing the optimization problem and introducing multi - view stereo vision cues. - **Robustness of volume fusion**: Traditional TSDF methods rely on accurate depth maps but are sensitive to local errors. This paper improves the robustness to depth map errors through soft - max operation and zero - crossing correction. - **Accuracy of object segmentation**: By joint 2D - 3D segmentation, combining the constraints of pixels and voxels, the accuracy of object segmentation from the background is improved. In summary, this paper improves the accuracy and robustness of reconstructing the 3D model of a single object from hand - held camera videos by improving depth map estimation, volume fusion and segmentation methods.