Abstract:Real-time reconstruction of 3D environment attributed with semantic information is significant for a variety of applications, such as obstacle detection, traffic scene comprehension and autonomous navigation. The current approaches to achieve it are mainly using stereo vision, Structure from Motion (SfM) or mobile LiDAR sensors. Each of these approaches has its own limitation, stereo vision has high computational cost, SfM needs accurate calibration between a sequences of images, and the onboard LiDAR sensor can only provide sparse points without color information. This paper describes a novel method for traffic scene semantic segmentation by combining sparse LiDAR point cloud (e.g. from Velodyne scans), with monocular color image. The key novelty of the method is the semantic coupling of stereoscopic point cloud with color lattice from camera image labelled through a Convolutional Neural Network (CNN). The presented method comprises three main process: (I) perform semantic segmentation on color image from monocular camera by using CNN, (II) extract ideal surfaces and other structural information from point cloud, (HI) improve the image segmentation with the extracts and label the point cloud with the image segments. The whole process is done in a single frame, and the output of the system is labelled point cloud which can be used in construction of semantic object convex and alignment between frames. We demonstrate the effectiveness of our system on the KITTI dataset providing sufficient camera and LiDAR data, and present qualitative and quantitative results indicating the improvements in segmentation comparing to methods merely using either image or LiDAR data.

Learning to Synthesize 3D Indoor Scenes from Monocular Images.

Learning 3 D Scene Synthesis from Annotated RGB-D Images

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Binocular Depth Estimation Using Convolutional Neural Network With Siamese Branches.

Up-to-Down Network: Fusing Multi-Scale Context for 3D Semantic Scene Completion

Deeper into Self-Supervised Monocular Indoor Depth Estimation

Learning to Reconstruct and Understand Indoor Scenes from Sparse Views

Indoor Scene Reconstruction From Monocular Video Combining Contextual and Geometric Priors

Depth Estimation from Monocular Images Using Dilated Convolution and Uncertainty Learning.

Synthetic Depth Transfer for Monocular 3D Object Pose Estimation in the Wild.

Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular Video Depth

Indoor Scene Generation from a Collection of Semantic-Segmented Depth Images

Learning Monocular Regression of 3D People in Crowds via Scene-aware Blending and De-occlusion

3D-to-2D Distillation for Indoor Scene Parsing

Learning Domain Invariant Features for Unsupervised Indoor Depth Estimation Adaptation

MonoIndoor: Towards Good Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments

Indoor Scene Classification by Incorporating Predicted Depth Descriptor.

Monocular depth estimation with hierarchical fusion of dilated CNNs and soft-weighted-sum inference

Physically-Based Rendering for Indoor Scene Understanding Using Convolutional Neural Networks

Depth Is All You Need for Monocular 3D Detection

3D Scene Reconstruction with Sparse LiDAR Data and Monocular Image in Single Frame