View-Invariant Skeleton Action Representation Learning via Motion Retargeting
Di Yang,Yaohui Wang,Antitza Dantcheva,Lorenzo Garattoni,Gianpiero Francesca,François Brémond
DOI: https://doi.org/10.1007/s11263-023-01967-8
IF: 13.369
2024-01-18
International Journal of Computer Vision
Abstract:Current self-supervised approaches for skeleton action representation learning often focus on constrained scenarios, where videos and skeleton data are recorded in laboratory settings. When dealing with estimated skeleton data in real-world videos , such methods perform poorly due to the large variations across subjects and camera viewpoints. To address this issue, we introduce ViA, a novel View-Invariant Autoencoder for self-supervised skeleton action representation learning. ViA leverages motion retargeting between different human performers as a pretext task, in order to disentangle the latent action-specific 'Motion' features on top of the visual representation of a 2D or 3D skeleton sequence. Such 'Motion' features are invariant to skeleton geometry and camera view and allow ViA to facilitate both, cross-subject and cross-view action classification tasks. We conduct a study focusing on transfer-learning for skeleton-based action recognition with self-supervised pre-training on real-world data ( e.g. , Posetics). Our results showcase that skeleton representations learned from ViA are generic enough to improve upon state-of-the-art action classification accuracy, not only on 3D laboratory datasets such as NTU-RGB+D 60 and NTU-RGB+D 120, but also on real-world datasets where only 2D data are accurately estimated, e.g. , Toyota Smarthome, UAV-Human and Penn Action. Code and models will be publicly available at https://walker-a11y.github.io/ViA-project.
computer science, artificial intelligence