Abstract:Human and many other animals can detect, recognize, and classify natural actions in a very short time. How this is achieved by the visual system and how to make machines understand natural actions have been the focus of neurobiological studies and computational modeling in the last several decades. A key issue is what spatial-temporal features should be encoded and what the characteristics of their occurrences are in natural actions. Current global encoding schemes depend heavily on segmenting while local encoding schemes lack descriptive power. Here, we propose natural action structures, i.e., multi-size, multi-scale, spatial-temporal concatenations of local features, as the basic features for representing natural actions. In this concept, any action is a spatial-temporal concatenation of a set of natural action structures, which convey a full range of information about natural actions. We took several steps to extract these structures. First, we sampled a large number of sequences of patches at multiple spatial-temporal scales. Second, we performed independent component analysis on the patch sequences and classified the independent components into clusters. Finally, we compiled a large set of natural action structures, with each corresponding to a unique combination of the clusters at the selected spatial-temporal scales. To classify human actions, we used a set of informative natural action structures as inputs to two widely used models. We found that the natural action structures obtained here achieved a significantly better recognition performance than low-level features and that the performance was better than or comparable to the best current models. We also found that the classification performance with natural action structures as features was slightly affected by changes of scale and artificially added noise. We concluded that the natural action structures proposed here can be used as the basic encoding units of actions and may hold the key to natural action understanding.

Attention with Structure Regularization for Action Recognition.

Recurrent Attention Network Using Spatial-Temporal Relations for Action Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data

Select and Focus: Action Recognition with Spatial-Temporal Attention

Online Robust Action Recognition Based on a Hierarchical Model

Where and When to Look? Spatio-temporal Attention for Action Recognition in Videos.

Skeleton-based Attention-Aware Spatial-Temporal Model for Action Detection and Recognition.

Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection

A Two-Layer Representation For Large-Scale Action Recognition

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

Interpretable Spatio-temporal Attention for Video Action Recognition

Nesting Spatiotemporal Attention Networks for Action Recognition.

An Improved Attention-Based Spatiotemporal-Stream Model for Action Recognition in Videos

Robust action recognition using multi-scale spatial-temporal concatenations of local features as natural action structures

Structured Attention Composition for Temporal Action Localization

Attribute Regularization Based Human Action Recognition.

Action Recognition with Actons

Spatio-Temporal Attention Deep Network for Skeleton Based View-Invariant Human Action Recognition

Attention-Oriented Action Recognition for Real-Time Human-Robot Interaction

SSTA-Net: Self-supervised Spatio-Temporal Attention Network for Action Recognition.