3D Hand Pose Estimation from a Single RGB Image Through Semantic Decomposition of VAE Latent Space

Guo Xinru,Xu Song,Lin Xiangbo,Sun Yi,Ma Xiaohong
DOI: https://doi.org/10.1007/s10044-021-01048-x
IF: 2.307
2022-01-01
Pattern Analysis and Applications
Abstract:Based on the disentanglement representation learning theory and the cross-modal variational autoencoder (VAE) model, we derive a “ Single Input Multiple Output ” (SIMO) disentangled model cmSIMO - β VAE . With the guidance of this derived model, we design a new VAE network, named da-VAE, for the challenging task of 3D hand pose estimation from a single RGB image. The designed da-VAE network has a multi-head encoder with the attention modules. Cooperating with the specific supervisions, the latent space is decomposed into subspaces with explicit semantics, which are relevant to the generative factors of hand pose, shape, appearance and others. The performance of the proposed da-VAE network is evaluated on RHD and STB dataset. The experimental results show competitive accuracies with the state-of-the-art methods.
What problem does this paper attempt to address?