Progressive Temporal Transformer for Bird’s-Eye-View Camera Pose Estimation

Zhuoyuan Wu,Jiancheng Cai,Ranran Huang,Xinmin Liu,Zhenhua Chai
DOI: https://doi.org/10.1007/978-981-99-8076-5_10
2024-01-01
Abstract:Visual relocalization is a crucial technique used in visual odometry and SLAM to predict the 6-DoF camera pose of a query image. Existing works mainly focus on ground view in indoor or outdoor scenes. However, camera relocalization on unmanned aerial vehicles is less focused. Also, frequent view changes and a large depth of view make it more challenging. In this work, we establish a Bird’s-Eye-View (BEV) dataset for camera relocalization, a large dataset contains four distinct scenes ( roof , farmland , bare ground , and urban area ) with such challenging problems as frequent view changing, repetitive or weak textures and large depths of fields. All images in the dataset are associated with a ground-truth camera pose. The BEV dataset contains 177242 images, a challenging large-scale dataset for camera relocalization. We also propose a Progressive Temporal transFormer (dubbed as PTFormer) as the baseline model. PTFormer is a sequence-based transformer with a designed progressive temporal aggregation module for temporal correlation exploitation and a parallel absolute and relative prediction head for implicitly modeling the temporal constraint. Thorough experiments are exhibited on both the BEV dataset and widely used handheld datasets of 7Scenes and Cambridge Landmarks to prove the robustness of our proposed method.
What problem does this paper attempt to address?