Robot Pose Estimation Method Based on Image and Point Cloud Fusion with Dynamic Feature Elimination

Zhang Lei,Xu Xiaobin,Cao Chenfei,He Jia,Ran Yngying,Tan Zhiying,Luo Minzhou
DOI: https://doi.org/10.3788/cjl202249.0610001
2022-01-01
Abstract:Objective Robot positioning is an important component of both robot navigation and SLAM technology. During robot positioning, LIDAR and cameras are often used to collect environmental data. Through calculations, structural or texture features in the environment are obtained and, as a result, the robot' s pose is indirectly determined. However, dynamic objects in the actual environment will have a great impact on the accuracy of pose estimation. The position of dynamic features in the global coordinate system changes, affecting the robot's relative pose estimation. GPS-IMU combined navigation is another popular positioning method. When the robot locates in the field, tunnel, underground, or other environments, GPS signals are often blocked, resulting in signal loss. However, the positioning of a single LIDAR or vision sensor is often limited by a specific use environment. For example, the camera cannot be used in low-light conditions. Therefore, the multisensor fusion positioning method has a greater application value. This paper proposes a pose estimation algorithm based on deep learning and adaptive fusion of pose for LIDAR and camera pose estimation in a dynamic environment. Methods This article suggests a pose estimation algorithm for LIDAR and stereo camera pose estimation in a dynamic environment that is based on deep learning and adaptive pose fusion. YOLOv4 is used to extract candidate frames of potential moving objects in the image. Then, the optical flow method is used to track the corner points of the front and rear frames and eliminate dynamic features based on the candidate frame. The reprojection error function is developed from triangulated map and feature points. Nonlinear optimization using RANSAC is used to find the best pose. PointRCNN is used in LIDAR pose estimation to extract candidate frames from a point cloud of potential moving objects. Meanwhile, the linear and planar feature points in the point cloud are extracted and screened according to the candidate frame. The point-to-line and point-to-surface distances are used to construct an error function that calculates the poses of the preceding and following frames. Finally, the pose estimation results of the two are dynamically weighted and fused based on the number of feature points of the image and point cloud. Results and Discussions The public KITTI data set and the experimental data collected by the experimental platform we built in dynamic scenarios are compared to validate the effect of dynamic objects on pose estimation and the effectiveness of the fusion pose estimation algorithm proposed in this paper. First, while comparing, it is discovered that after excluding the dynamic features, the errors of the six components of the pose are reduced in most cases. The comprehensive error of the visual pose algorithm in this paper in the two scenes is reduced by 0.0300 degrees and 0.0167 m on average. The displacement error of the LIDAR pose algorithm is reduced by 0.0010 m; however, the angle error is increased by 0.0016 degrees . Simultaneously, the accuracy of visual pose estimation is more obvious than LIDAR pose estimation (Table 1). The fusion result is compared with the average error of BA, LOAM, and ORBSLAM2. The results of the fusion algorithm used in this paper produce fewer errors in the 05 sequence than the BA vision algorithm and LOAM. Compared with BA and LOAM algorithms, our fusion algorithm' s displacement error is reduced by 0.0105 m and 0.0010 m, respectively. The displacement errors of our algorithm in scenes 05 and 08 are reduced by 0.0081 m and 0.0026 m compared with the results of ORBSLAM2 algorithm (Table 3). Second, this paper constructs an experimental platform comprising two parallel stereo cameras and LIDAR that simulates the indoor dynamic environment. Six pedestrians move within the field of view of the camera and LIDAR. Experimental results show that after removing dynamic features, the vision and LIDAR pose estimation results outperform the same type of algorithms. The average relative error of the angle and displacement of the fusion result is 0.0944 degrees and 0.0078 m, which is 0.1918 degrees and 0.0045 m greater than the accuracy of the LOAM algorithm. The accuracy of the algorithm is increased by 0.0100 degrees compared to ORBSLAM2 (Table 4). Conclusions This paper presents a robot pose estimation algorithm that is based on the fusion of dynamic feature elimination images and point clouds. The method of deep learning is used to extract the candidate frame of the target object from an image and point cloud, which is then used for data processing and feature optimization. It completely avoids the error function abnormality caused by incorrect matching of dynamic features and eliminates its effect on the pose estimation. Simultaneously, this paper performs a dynamic weighted fusion of the pose based on the number of feature points. Finally, this paper uses the public KITTI data set and the experimental data collected by the experimental platform construct-in dynamic scenarios to compare the pose estimation accuracy of the three mainstream algorithms of BA, LOAM, and ORBSLAM2. Experiments show that removing dynamic features improves the accuracy of pose estimation to varying degrees. The posture result after fusion is more stable. Furthermore, the sequential processing logic ensures that the system is unaffected by the running time in the offline state to correctly process each frame of data.
What problem does this paper attempt to address?