Skeleton-Based Spatio-Temporal U-Network for 3D Human Pose Estimation in Video

Weiwei Li,Rong Du,Shudong Chen
DOI: https://doi.org/10.3390/s22072573
IF: 3.9
2022-03-28
Sensors
Abstract:Despite the great progress in 3D pose estimation from videos, there is still a lack of effective means to extract spatio-temporal features of different granularity from complex dynamic skeleton sequences. To tackle this problem, we propose a novel, skeleton-based spatio-temporal U-Net(STUNet) scheme to deal with spatio-temporal features in multiple scales for 3D human pose estimation in video. The proposed STUNet architecture consists of a cascade structure of semantic graph convolution layers and structural temporal dilated convolution layers, progressively extracting and fusing the spatio-temporal semantic features from fine-grained to coarse-grained. This U-shaped network achieves scale compression and feature squeezing by downscaling and upscaling, while abstracting multi-resolution spatio-temporal dependencies through skip connections. Experiments demonstrate that our model effectively captures comprehensive spatio-temporal features in multiple scales and achieves substantial improvements over mainstream methods on real-world datasets.
engineering, electrical & electronic,chemistry, analytical,instruments & instrumentation
What problem does this paper attempt to address?