Abstract:This work pays the first research effort to address unsupervised 3-D action representation learning with point cloud sequence, which is different from existing unsupervised methods that rely on 3-D skeleton information. Our proposition is built on the state-of-the-art 3-D action descriptor 3-D dynamic voxel (3DV) with contrastive learning (CL). The 3DV can compress the point cloud sequence into a compact point cloud of 3-D motion information. Spatiotemporal data augmentations are conducted on it to drive CL. However, we find that existing CL methods (e.g., SimCLR or MoCo v2) often suffer from high pattern variance toward the augmented 3DV samples from the same action instance, that is, the augmented 3DV samples are still of high feature complementarity after CL, while the complementary discriminative clues within them have not been well exploited yet. To address this, a feature augmentation adapted CL (FACL) approach is proposed, which facilitates 3-D action representation via concerning the features from all augmented 3DV samples jointly, in spirit of feature augmentation. FACL runs in a global-local way: one branch learns global feature that involves the discriminative clues from the raw and augmented 3DV samples, and the other focuses on enhancing the discriminative power of local feature learned from each augmented 3DV sample. The global and local features are fused to characterize 3-D action jointly via concatenation. To fit FACL, a series of spatiotemporal data augmentation approaches is also studied on 3DV. Wide-range experiments verify the superiority of our unsupervised learning method for 3-D action feature learning. It outperforms the state-of-the-art skeleton-based counterparts by 6.4% and 3.6% with the cross-setup and cross-subject test settings on NTU RGB + D 120, respectively. The source code is available at https://github.com/tangent-T/FACL.

You Will Never Walk Alone: One-Shot 3D Action Recognition with Point Cloud Sequence

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

One-Shot Action Recognition via Multi-Scale Spatial-Temporal Skeleton Matching

Online Robust Action Recognition Based on a Hierarchical Model

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Beyond Pattern Variance: Unsupervised 3-D Action Representation Learning With Point Cloud Sequence

3D Action Recognition Using Multi-Temporal Skeleton Visualization.

KAN-HyperpointNet for Point Cloud Sequence-Based 3D Human Action Recognition

3DInAction: Understanding Human Actions in 3D Point Clouds

Unveiling the Hidden Realm: Self-supervised Skeleton-based Action Recognition in Occluded Environments

3D Action Recognition Using Data Visualization and Convolutional Neural Networks.

Part-aware Prototypical Graph Network for One-shot Skeleton-based Action Recognition

Attention-Based Multilevel Co-Occurrence Graph Convolutional LSTM for 3-D Action Recognition

A Novel 3D Human Action Recognition Framework for Video Content Analysis.

A Multi-Task Neural Network for Action Recognition with 3D Key-Points.

On the Importance of Spatial Relations for Few-shot Action Recognition

Enhancing Few-Shot Action Recognition Using Skeleton Temporal Alignment and Adversarial Training

Learning Spatial-Preserved Skeleton Representations for Few-Shot Action Recognition.

Exploring Self-Supervised Skeleton-Based Human Action Recognition under Occlusions