EgoFormer: Transformer-Based Motion Context Learning for Ego-Pose Estimation

Tianyi Li,Chi Zhang,Wei Su,Yuehu Liu
DOI: https://doi.org/10.1109/smc53992.2023.10394203
2023-01-01
Abstract:Ego-pose estimation, i.e. predicting 3D pose of the camera wearer, has an essential value in AR and VR applications. First-person video has an ambiguity in that similar video frames may correspond to totally different body poses because of the invisible body part. However, exploiting the context of a video and establishing a long-term temporal relationship can alleviate this ambiguity. To this end, this paper proposes EgoFormer, a Transformer-based model, to learn the motion context from egocentric videos. Moreover, dynamic features commonly used to characterize first-person video do not provide sufficient temporal information to remove the ambiguity inherent in such videos. Therefore, we present a method that can effectively extract temporal features in first-person videos. Results on real-scene and synthetic datasets show that our method could estimate a sequence of human poses with high accuracy and coherence.
What problem does this paper attempt to address?