Open-Vocabulary Skeleton Action Recognition with Diffusion Graph Convolutional Network and Pre-Trained Vision-Language Models

Chao Wei,Zhidong Deng
DOI: https://doi.org/10.1109/icassp48485.2024.10447118
2024-01-01
Abstract:This study explores unsupervised open-vocabulary skeleton action recognition, aiming at addressing inaccurate spatial matching and poor interpretability of existing GCN models. We present Skeleton-DGCFA, an approach to make feature alignment (FA) of skeleton with image modalities based on a large pre-trained vision and language (VL) model along with our new diffusion graph convolutional (DGC) skeleton encoder. The DGC comprises spatial and temporal convolutional modules, allowing for the diffusion of different graph semantic features. Skeleton-DGCFA harnesses recent large-scale VL models and extends their zero-shot capabilities to the skeleton modality by capitalizing on its natural pairing with images. The open-vocabulary zero-shot capabilities improve with the strength of the pre-trained VL model and our DGC skeleton encoder. We establish a new state-of-the-art in the zero-shot skeleton action recognition tasks, significantly surpassing the vanilla zero-shot method by 27.0% and 19.7% on NTU-60 and NTU-120, respectively.
What problem does this paper attempt to address?