Abstract:Recently, direct visual localization with convolutional neural networks has attracted researchers' attention with achieving an end-to-end process. However, on the one side, the lack of using 3D information leads to imprecise accuracy. Meanwhile, the single input image confuses the relocalization in the scenes that keep similar views at different positions. On the other side, the relocalization problem in variable or dynamic scenes is still challenging. Concentrating on these concerns, we propose two multitask relocalization networks called MMLNet and MMLNet+ for obtaining the 6-DoF camera pose in static, variable and dynamic scenes. Firstly, addressing the dataset lack of variable scenes, we construct a variable scene dataset with a semiautomatic process combining SFM and MVS algorithms with a few manual labels. Based on the process, three scenes covering an office, a bedroom and a sitting room are gathered and generated. Secondly, to enhance the perception between 2D images and 3D poses, we design a multitask network called MMLNet that regresses both camera pose and scene point cloud. Meanwhile, the Chamfer distance is joined into the original pose loss to optimize MMLNet. Moreover, MMLNet learns the pose trajectory feature by using LSTM layers to the additional pose array input, which meanwhile breaks through the limitation of single image input. Based on the MMLNet, aiming at dynamic and variable scenes, MMLNet+ outputs the auxiliary segmentation branch that distinguishes fixed, changeable or dynamic parts of the input image. Furthermore, we define the feature fusion block to implement the feature sharing among three tasks, further promoting the performance in dynamic and variable environments. Finally, experiments on static, dynamic and our constructed variable datasets demonstrate state-of-the-art relocalization performances of MMLNet and MMLNet+. Simultaneously, the positive effects of the pose learning part, reconstruction branch and segmentation task are also illustrated.

VidLoc: A Deep Spatio-Temporal Model for 6-DoF Video-Clip Relocalization

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

Recurrent Volume-Based 3-D Feature Fusion for Real-Time Multiview Object Pose Estimation.

Marker-Less 3d Human Motion Capture With Monocular Image Sequence And Height-Maps

Real-Time 6DOF Pose Relocalization for Event Cameras With Stacked Spatial LSTM Networks

Deep 6-DoF camera relocalization in variable and dynamic scenes by multitask learning

UnLoc: A Unified Framework for Video Localization Tasks

Region Deformer Networks for Unsupervised Depth Estimation from Unconstrained Monocular Videos

Temporal-Aware SfM-Learner: Unsupervised Learning Monocular Depth and Motion from Stereo Video Clips.

Local Optimized and Scalable Frame-to-model SLAM

Deep Camera Pose Regression Using Pseudo-LiDAR

Local Supports Global: Deep Camera Relocalization With Sequence Enhancement

EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization

PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization

Real-time Image-based 6-DOF Localization in Large-Scale Environments

Spatio-Temporal Action Localization in a Weakly Supervised Setting

Modeling Spatio-Temporal Human Track Structure for Action Localization

CyberLoc: Towards Accurate Long-term Visual Localization

WSCLoc: Weakly-Supervised Sparse-View Camera Relocalization

Variational State-Space Models for Localisation and Dense 3D Mapping in 6 DoF

Reloc3r: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization