Geometric Consistency-Guaranteed Spatio-Temporal Transformer for Unsupervised Multi-View 3D Pose Estimation

Kaiwen Dong,Kévin Riou,Jingwen Zhu,Andréas Pastor,Kévin Subrin,Yu Zhou,Xiao Yun,Yanjing Sun,Patrick Le Callet
DOI: https://doi.org/10.1109/tim.2024.3440376
IF: 5.6
2024-01-01
IEEE Transactions on Instrumentation and Measurement
Abstract:Unsupervised 3D pose estimation has gained prominence due to the challenges in acquiring labeled 3D data for training. Despite promising progress, unsupervised approaches still lag behind supervised methods in performance. Two factors impede the progress of unsupervised approaches: incomplete geometric constraint and inadequate interaction among spatial, temporal, and multi-view features. This paper introduces an unsupervised pipeline that uses calibrated camera parameters as geometric constraints across views and coordinate spaces to optimize the model by minimizing inconsistencies between the 2D input pose and the re-projection of the predicted 3D pose. This pipeline utilizes the novel Hierarchical Cross Transformer (HCT) to encode higher levels of information by enabling interactions among hierarchical features containing different level of temporal, spatial and cross-view information. By minimizing the reliance on human-specific parts, the HCT shows potential for adapting to various pose estimation tasks. To validate the adaptability, we build a connection between human pose estimation and scene pose estimation, introducing Dynamic-Keypoints-3D (DK-3D) dataset tailored for 3D Scene Pose Estimation in robotic manipulation. Experiments on two 3D human pose estimation datasets demonstrate our method’s new state-of-the-art performance among weakly and unsupervised approaches. The adaptability of our method is confirmed through experiments on DK-3D, setting the initial benchmark for unsupervised 2D-to-3D scene pose lifting.
What problem does this paper attempt to address?