SkeletonCLIP: Recognizing Skeleton-based Human Actions with Text Prompts

Lin Yuan,Zhen He,Qiang Wang,Leiyang Xu,Xiang Ma
DOI: https://doi.org/10.1109/icsai57119.2022.10005459
2022-01-01
Abstract:Human action recognition has been a hot research for decades, and mainstream supervised frameworks include a feature extraction backbone and a softmax classifier to predict daily human actions. When the number of classes applied to the dataset changes, we must retrain the classifier on the well-trained backbone. This pipeline restricts the generalization and transfer ability of the model due to an extra training period. Moreover, replacing action labels with simple number labels discards useful semantic information and can only receive a meaningless classifier at last. In this work, we present a model SkeletonCLIP for skeleton-based human action recognition. We add an alternative text encoder to extract semantic information from labels while keeping the original sequence encoder. We use dot production to measure the similarities of sequence-text pairs in place of traditional classifier head and cross-entropy loss. Experiments from three human action datasets show that our framework can reach a higher recognition accuracy with the help of semantic information when training the network from scratch. The code has been shown at eunseo-v/SkeletonCLIP.
What problem does this paper attempt to address?