Abstract:Unsupervised skeleton based action recognition has achieved remarkable progress recently. Existing unsupervised learning methods suffer from severe overfitting problem, and thus small networks are used, significantly reducing the representation capability. To address this problem, the overfitting mechanism behind the unsupervised learning for skeleton based action recognition is first investigated. It is observed that the skeleton is already a relatively high-level and low-dimension feature, but not in the same manifold as the features for action recognition. Simply applying the existing unsupervised learning method may tend to produce features that discriminate the different samples instead of action classes, resulting in the overfitting problem. To solve this problem, this paper presents an Unsupervised spatial-temporal Feature Enrichment and Fidelity Preservation framework (U-FEFP) to generate rich distributed features that contain all the information of the skeleton sequence. A spatial-temporal feature transformation subnetwork is developed using spatial-temporal graph convolutional network and graph convolutional gate recurrent unit network as the basic feature extraction network. The unsupervised Bootstrap Your Own Latent based learning is used to generate rich distributed features and the unsupervised pretext task based learning is used to preserve the information of the skeleton sequence. The two unsupervised learning ways are collaborated as U-FEFP to produce robust and discriminative representations. Experimental results on three widely used benchmarks, namely NTU-RGB+D-60, NTU-RGB+D-120 and PKU-MMD dataset, demonstrate that the proposed U-FEFP achieves the best performance compared with the state-of-the-art unsupervised learning methods. t-SNE illustrations further validate that U-FEFP can learn more discriminative features for unsupervised skeleton based action recognition.

FENet: An Efficient Feature Excitation Network for Video-based Human Action Recognition

Encoding Learning Network Combined with Feature Similarity Constraints for Human Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Frequency Enhancement Network for Efficient Compressed Video Action Recognition

ACTION-Net: Multipath Excitation for Action Recognition

TEINet: Towards an Efficient Architecture for Video Recognition.

DC3D: A Video Action Recognition Network Based on Dense Connection

FEXNet: Foreground Extraction Network for Human Action Recognition

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

FEASE: Feature Selection and Enhancement Networks for Action Recognition

Efficient spatio-temporal network for action recognition

Cross-Fiber Spatial-Temporal Co-enhanced Networks for Video Action Recognition

Fine-Tuned Temporal Dense Sampling with 1D Convolutional Neural Network for Human Action Recognition

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition

Attentional Fused Temporal Transformation Network for Video Action Recognition.

Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks

Human Action Recognition Based on Improved Fusion Attention CNN and RNN

Dense Semantics-Assisted Networks For Video Action Recognition

Energy-Guided Temporal Segmentation Network for Multimodal Human Action Recognition

Unsupervised Spatial-Temporal Feature Enrichment and Fidelity Preservation Network for Skeleton based Action Recognition