One Point, One Object: Simultaneous 3D Object Segmentation and 6-DOF Pose Estimation

Hongsen Liu
2024-06-06
Abstract:We propose a single-shot method for simultaneous 3D object segmentation and 6-DOF pose estimation in pure 3D point clouds scenes based on a consensus that \emph{one point only belongs to one object}, i.e., each point has the potential power to predict the 6-DOF pose of its corresponding object. Unlike the recently proposed methods of the similar task, which rely on 2D detectors to predict the projection of 3D corners of the 3D bounding boxes and the 6-DOF pose must be estimated by a PnP like spatial transformation method, ours is concise enough not to require additional spatial transformation between different dimensions. Due to the lack of training data for many objects, the recently proposed 2D detection methods try to generate training data by using rendering engine and achieve good results. However, rendering in 3D space along with 6-DOF is relatively difficult. Therefore, we propose an augmented reality technology to generate the training data in semi-virtual reality 3D space. The key component of our method is a multi-task CNN architecture that can simultaneously predicts the 3D object segmentation and 6-DOF pose estimation in pure 3D point clouds. For experimental evaluation, we generate expanded training data for two state-of-the-arts 3D object datasets \cite{PLCHF}\cite{TLINEMOD} by using Augmented Reality technology (AR). We evaluate our proposed method on the two datasets. The results show that our method can be well generalized into multiple scenarios and provide performance comparable to or better than the state-of-the-arts.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the problem of object detection and 6 degrees of freedom (6-DOF) pose estimation in 3D space. Specifically, the research proposes an efficient method that can simultaneously perform 3D object detection and 6-DOF pose estimation in pure 3D point cloud scenes. This method is based on a simple consensus that each point belongs to only one object, thus each point has the potential to predict the 6-DOF pose of its corresponding object. The main contributions of the paper are as follows: 1. **Efficient Single-Pass Method**: A concise method is proposed that can directly perform point-level predictions on 3D point clouds without converting the irregular point clouds into regular 3D voxel grids or performing step-by-step processing. The core of this method is a multi-task segmentation and prediction network that can simultaneously predict: - Point-level semantic segmentation to filter background points and reduce the search space; - 3D positions of the vertices of the object's 3D bounding box for estimating the 6-DOF pose transformation; - Confidence scores to evaluate the accuracy of the 3D bounding box predictions. 2. **Augmented Reality Technology for Dataset Generation**: An effective dataset generation method based on augmented reality (AR) technology is designed, which can quickly create 3D object recognition datasets for fixed work scenes and generate extended training data for two existing 3D object recognition datasets. 3. **Experimental Validation**: The effectiveness of the proposed method is validated through extensive experiments on two public datasets (LC-HF and LineMod). The results show that the method can generalize well to various scenarios and its performance is comparable to or even surpasses existing state-of-the-art methods. In summary, this research proposes a new method that operates directly on 3D point clouds, achieving 3D object detection and 6-DOF pose estimation without complex post-processing steps. Additionally, augmented reality technology is used to generate extra training data to further improve the method's performance.