Abstract:Optical motion capture systems have been used intensively to obtain human body poses. However, there still exist several problems. First is the dislocation problem caused by joints being too close together. The second is the joint lost problem. Restricted by severe self-occlusions, cameras may not capture the target joints. Given this observation, we investigate the high-level constraints over human poses to solve these two problems. In this work, we present a Simplified-attention Enhanced Graph Convolutional Network (SaEGC-Net) to extract both spatial and temporal features from monocular videos flexibly. The SaEGC-Net for 3D human pose estimation is U-shaped and involves the Cascaded Spatial-Temporal Graph Convolutional (CST-GC) blocks and the Simplified Spatial-Temporal Attention (SST-Att) blocks, allowing for drawing long-range dependencies between unconnected joints by graph topologies and attention mechanism, respectively. Specifically, the CST-GC block embeds two predefined graph structures into a convolutional network, incorporating discriminative features from distant joints. The proposed SST-Att block disregards redundant information by sharing part of the attention map, which is highly lightweight. It also considers dimension-expanded joint relationships to maintain the diversity of dependence. To evaluate the effectiveness of our method, we conduct extensive experiments on two datasets: Human3.6M and our own dataset FDU-Motion. Results demonstrate that our model achieves excellent performance and can competently handle the above two problems. Also, ablation studies show that our network’s submodules can better exploit the motion information of the human body.

PoseGTAC: Graph Transformer Encoder-Decoder with Atrous Convolution for 3D Human Pose Estimation

Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation

DGFormer: Dynamic Graph Transformer for 3D Human Pose Estimation

SPGformer: Serial-Parallel Hybrid GCN-Transformer with Graph-Oriented Encoder for 2D-to-3d Human Pose Estimation

SPGformer: Serial–Parallel Hybrid GCN-Transformer With Graph-Oriented Encoder for 2-D-to-3-D Human Pose Estimation

GLA-GCN: Global-local Adaptive Graph Convolutional Network for 3D Human Pose Estimation from Monocular Video

3D Human Pose Estimation with Multi-Scale Graph Convolution and Hierarchical Body Pooling

Hierarchical Graph Networks for 3D Human Pose Estimation

3D Human Pose Estimation Via Graph Extended Spatio-Temporal Convolutional Network

Graphrpe: Relative Position Encoding Graph Transformer for 3d Human Pose Estimation

FMR-GNet: Forward Mix-Hop Spatial-Temporal Residual Graph Network for 3D Pose Estimation

Multi-hop graph transformer network for 3D human pose estimation

Simplified-attention Enhanced Graph Convolutional Network for 3D human pose estimation

MPA-GNet: Multi-Scale Parallel Adaptive Graph Network for 3D Human Pose Estimation

GTFormer: 3D Driver Body Pose Estimation in Video with Graph Convolution Network and Transformer

Optimizing Network Structure for 3D Human Pose Estimation.

HPGCN: Hierarchical Poselet-Guided Graph Convolutional Network for 3D Pose Estimation

STGFormer: Spatio-Temporal GraphFormer for 3D Human Pose Estimation in Video

Enhanced Spatial–temporal Dynamics in Pose Forecasting Through Multi-Graph Convolution Networks

Multi-Graph Convolution Network for Pose Forecasting

DGT: Dynamic Graph Transformer for Enhanced Processing of Dynamic Joint Sequences in 2D Human Pose Estimation