Efficient Hierarchical Multi-view Fusion Transformer for 3D Human Pose Estimation

Feng Lu,Lijun Zhang,Xiang-Dong Zhou,Kangkang Zhou,Yu Shi
DOI: https://doi.org/10.1145/3581783.3612098
2023-10-26
Abstract:In multi-view 3D human pose estimation (HPE), information from different viewpoints is highly variable due to complex factors such as background and occlusion, making cross-view feature extrac tion and fusion difficult. Most existing methods have problems of over-reliance on camera parameters or insufficient semantic feature extraction. To address these issues, this paper proposes a hierar chical multi-view fusion transformer (HMVformer) framework for 3D HPE, incorporating cross-view feature fusion methods into the spatial and temporal feature extraction process in a coarse-to-fine manner. To begin, global to local attention graph features are ex tracted and incorporated with the original pose features to better preserve the spatial structure semantic knowledge. Then, various cross-view feature fusion modules are built and embedded into the pose feature extraction for consistent and distinctive information fusion across multiple viewpoints. Furthermore, sequential tem poral information is extracted and fused with spatial knowledge for feature refinement and depth uncertainty reduction. Extensive experiments on three popular 3D HPE benchmarks show that HMV former achieves state-of-the-art results without relying on complex loss functions or providing camera parameters, simple but effective in mitigating depth ambiguity and improving 3D pose prediction accuracy. Codes and models are available1.
Computer Science
What problem does this paper attempt to address?