Abstract:Reducing redundancy is crucial for improving the efficiency of video recognition models. An effective approach is to select informative content from the holistic video, yielding a popular family of dynamic video recognition methods. However, existing dynamic methods focus on either temporal or spatial selection independently while neglecting a reality that the redundancies are usually spatial and temporal, simultaneously. Moreover, their selected content is usually cropped with fixed shapes (e.g., temporally-cropped frames, spatially-cropped patches), while the realistic distribution of informative content can be much more diverse. With these two insights, this paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net). From different frames and positions, AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of "action keypoints" and then transforms the video recognition into point cloud classification. More concretely, AK-Net has two steps, i.e., the keypoint selection and the point cloud classification. First, it inputs the video into a baseline network and outputs a feature map from an intermediate layer. We view each pixel on this feature map as a spatial-temporal point and select some informative keypoints using self-attention. Second, AK-Net devises a ranking criterion to arrange the keypoints into an ordered 1D sequence. Since the video is represented with a 1D sequence after the specified layer, AK-Net transforms the subsequent layers into a point cloud classification sub-net by compacting the original 2D convolutional kernels into 1D kernels. Consequentially, AK-Net brings two-fold benefits for efficiency: The keypoint selection step collects informative content within arbitrary shapes and increases the efficiency for modeling spatial-temporal dependencies, while the point cloud classification step further reduces the computational cost by compacting the convolutional kernels. Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.

APSNet: Toward Adaptive Point Sampling for Efficient 3D Action Recognition

Adaptive Recurrent Forward Network for Dense Point Cloud Completion

APSNet: Attention Based Point Cloud Sampling

PRENet: A Plane-Fit Redundancy Encoding Point Cloud Sequence Network for Real-Time 3D Action Recognition

KAN-HyperpointNet for Point Cloud Sequence-Based 3D Human Action Recognition

3D-Pruning: A Model Compression Framework for Efficient 3D Action Recognition

GeometryMotion-Net: A Strong Two-stream Baseline for 3D Action Recognition

Real-time 3D human action recognition based on Hyperpoint sequence

Action Keypoint Network for Efficient Video Recognition

3DInAction: Understanding Human Actions in 3D Point Clouds

3DV: 3D Dynamic Voxel for Action Recognition in Depth Video

Hybrid Attentive Prototypical Network for Few-Shot Action Recognition

HOPC: Histogram of Oriented Principal Components of 3D Pointclouds for Action Recognition

Dynamic Sampling Networks for Efficient Action Recognition in Videos.

A Pairwise Attentive Adversarial Spatiotemporal Network for Cross-Domain Few-Shot Action Recognition-R2.

T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition

Action Recognition Based on A Selective Sampling Strategy for Real-Time Video Surveillance

You Will Never Walk Alone: One-Shot 3D Action Recognition with Point Cloud Sequence

TTPOINT: A Tensorized Point Cloud Network for Lightweight Action Recognition with Event Cameras

EP-Net: Improving Point Cloud Learning Efficiency Through Feature Decoupling