Abstract:We propose an approach for 3D reconstruction and segmentation of a single object placed on a flat surface from an input video. Our approach is to perform dense depth map estimation for multiple views using a proposed objective function that preserves detail. The resulting depth maps are then fused using a proposed implicit surface function that is robust to estimation error, producing a smooth surface reconstruction of the entire scene. Finally, the object is segmented from the remaining scene using a proposed 2D-3D segmentation that incorporates image and depth cues with priors and regularization over the 3D volume and 2D segmentations. We evaluate 3D reconstructions qualitatively on our Object-Videos dataset, comparing to fusion, multiview stereo, and segmentation baselines. We also quantitatively evaluate the dense depth estimation using the RGBD Scenes V2 dataset [Henry et al. 2013] and the segmentation using keyframe annotations of the Object-Videos dataset.

What problem does this paper attempt to address?

This paper aims to solve the problems of reconstructing the 3D model of a single object and segmentation in videos obtained from hand - held cameras. Specifically, the paper proposes a method that can perform 3D reconstruction and segmentation from an input video of a single object placed on a flat surface. This method preserves details through multi - view dense depth map estimation and fuses these depth maps using an implicit surface function that is robust to estimation errors, ultimately generating a smooth surface reconstruction of the entire scene. In addition, the segmentation of the object from the remaining scene is achieved by combining image and depth cues as well as priors and regularizations on 3D volumes and 2D segmentation. ### Main contributions of the paper: 1. **Fully - automatic 3D model generation**: A fully - automatic method is proposed that can generate a 3D mesh from a single object placed on a flat surface. 2. **Robust depth surface estimation**: A method is proposed to robustly estimate the depth surface through multi - view stereo vision cues and sparse point clouds (optional), and is regularized by rotation - invariant bending energy. 3. **Improved volume fusion method**: The volume fusion method of depth maps is re - defined. By using soft - max instead of truncation and weighted average to calculate TSDF, the robustness to depth map errors is improved and a smoother surface is generated. 4. **Joint 2D - 3D segmentation**: 2D image segmentation and volume 3D reconstruction are jointly modeled as a discrete label assignment problem. Each pixel and voxel is respectively assigned a label of "object" or "background", which is solved through a graph - cut optimization framework. 5. **System evaluation**: The performance of the entire system is evaluated by the pixel - segmentation accuracy on the Object - Videos dataset, which contains 12 object videos recorded by mobile phone cameras and provides ground - truth segmentation masks for selected frames. ### Key problems solved: - **Accuracy of depth map estimation**: Existing methods have difficulty in consistently generating high - quality 3D models when dealing with objects with specular reflection and irregular shapes. This paper improves the accuracy of depth maps by formalizing the optimization problem and introducing multi - view stereo vision cues. - **Robustness of volume fusion**: Traditional TSDF methods rely on accurate depth maps but are sensitive to local errors. This paper improves the robustness to depth map errors through soft - max operation and zero - crossing correction. - **Accuracy of object segmentation**: By joint 2D - 3D segmentation, combining the constraints of pixels and voxels, the accuracy of object segmentation from the background is improved. In summary, this paper improves the accuracy and robustness of reconstructing the 3D model of a single object from hand - held camera videos by improving depth map estimation, volume fusion and segmentation methods.

3DFS: Deformable Dense Depth Fusion and Segmentation for Object Reconstruction from a Handheld Camera

In-Hand 3D Object Reconstruction from a Monocular RGB Video

ObjectFusion: an Object Detection and Segmentation Framework with RGB-D SLAM and Convolutional Neural Networks

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

Robust 3D Reconstruction with an RGB-D Camera

3d Reconstruction Of Dynamic Scenes With Multiple Handheld Cameras

Online Global Non-rigid Registration for 3D Object Reconstruction Using Consumer-level Depth Cameras

Chunkfusion: A Learning-Based RGB-D 3D Reconstruction Framework Via Chunk-Wise Integration

Mobile3DScanner: an Online 3D Scanner for High-quality Object Reconstruction with a Mobile Device

Spatio-Temporal Depth Recovery of Dynamic Scenes with Multiple Handheld Cameras

3DFusion, A real-time 3D object reconstruction pipeline based on streamed instance segmented data

Hybrid-MVS: Robust Multi-View Reconstruction with Hybrid Optimization of Visual and Depth Cues

BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects

Saliency-aware Real-time Volumetric Fusion for Object Reconstruction.

Object Modelling with a Handheld RGB-D Camera

Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips

Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular Video Depth

FusionVision: A comprehensive approach of 3D object reconstruction and segmentation from RGB-D cameras using YOLO and fast segment anything

InstanceFusion: Real‐time Instance‐level 3D Reconstruction Using a Single RGBD Camera

Real-time 3D Scene Reconstruction with Dynamically Moving Object Using a Single Depth Camera

Learning Hand-Held Object Reconstruction from In-The-Wild Videos