Abstract:The local spatio-temporal descriptor and feature encoding algorithm are two crucial key steps for human action recognition based on spatio-temporal interest points (STIP). Since the local descriptors for STIP are essentially a type of motion information based on the texture, the key point of local feature description is to extract invariable, robust and distinguishable local texture features and motion information in reference spatio-temporal volume. Scattering transform is an image transform method based on directional wavelet transform and scale convolution, which has local translation invariance, rotation invariance and elastic deformation stability for local texture features. A novel local descriptor for STIP based on spatio-temporal three-dimensional scattering transform is proposed in this paper, which extends the original scattering transform to spatio-temporal three-dimensional space. Compared to the traditional descriptors, such as HOG, HOF and so on, the proposed scattering transform coefficients based histogram of oriented gradients (STC-HOG) descriptor can capture more robust and distinguishable motion information of local texture for STIP. In order to incorporate the local descriptors into action video representation, the feature encoding algorithm is indispensable. For the problem that vector of locally aggregated descriptors (VLAD) loses feature distribution location information during feature encoding, a histogram of distribution vector of locally aggregated descriptors (HOD-VALD) based on Gaussian kernel is proposed. We validated the proposed algorithm for human action recognition on multiple public available datasets, such as KTH, UCF Sports, HMDB51 and so on. The evaluation experiment results indicate that the proposed descriptor and encoding method can improve the efficiency of human action recognition and the recognition accuracy.

Action Recognition Using Hierarchical STIP Saliency and Mixed Neighborhood Features

Action Recognition with Stacked Fisher Vectors.

Action Recognition Using Discriminative Spatio-Temporal Neighborhood Features

Action Recognition Based on Multi-scale Oriented Neighborhood Features

Human Action Recognition Using Multi-Velocity STIPs and Motion Energy Orientation Histogram.

Action recognition using a hierarchy of feature groups

Multiscale Spatial Position Coding under Locality Constraint for Action Recognition

Human Action Recognition Based on Spatio-Temporal Three-Dimensional Scattering Transform Descriptor and an Improved VLAD Feature Encoding Algorithm

Action Detection by Fusing Hierarchically Filtered Motion with Spatiotemporal Interest Point Features

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Action Recognition Via Cumulative Histogram of Multiple Features

Research on Local Spatio-Temporal Features for Action Recognition

Efficient Local Filter Bank with over Complete Spatiotemporal Pooling in Action Recognition

Extracting Hierarchical Spatial and Temporal Features for Human Action Recognition

Action Recognition Based on Spatio-temporal Log-Euclidean Covariance Matrix

Local Spatiotemporal Coding and Sparse Representation Based Human Action Recognition

Action Recognition Using Polyhedron Neighborhood Features

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Action Recognition with Spatial-Temporal Representation Analysis Across Grassmannian Manifold and Euclidean Space

Human Action Recognition Algorithm Based on Spatial Temporal Depth Feature

Action Recognition by Spatio-Temporal Oriented Energies