Abstract:In this paper, we present a unified representation based on the spatio-temporal steerable pyramid (STSP) for the holistic representation of human actions. A video sequence is viewed as a spatio-temporal volume preserving all the appearance and motion information of an action in it. By decomposing the spatio-temporal volumes into band-passed sub-volumes, the spatio-temporal Laplacian pyramid provides an effective technique for multi-scale analysis of video sequences, and spatio-temporal patterns with different scales could be well localized and captured. To efficiently explore the underlying local spatio-temporal orientation structures at multiple scales, a bank of three-dimensional separable steerable filters are conducted on each of the sub-volume from the Laplacian pyramid. The outputs of the quadrature pair of steerable filters are squared and summed to yield a more robust oriented energy representation. To be further invariant and compact, a spatio-temporal max pooling operation is performed between responses of the filtering at adjacent scales and over spatio-temporal neighbourhoods. In order to capture the appearance, local geometric structure and motion of an action, we apply the STSP on the intensity, 3D gradients and optical flow of video sequences, yielding a unified holistic representation of human actions.Taking advantage of multi-scale, multi-orientation analysis and feature pooling, STSP produces a compact but informative and invariant representation of human actions. We conduct extensive experiments on the KTH, UCF Sports and HMDB51 datasets, which shows the unified STSP achieves comparable results with the state-of-the-art methods.

Human Action Retrieval via Spatio-temporal Cuboids

High-Order PCA of Video Volume Tensors for Human Action Representation and Recognition Shu Kong And

Video Action Detection With Relational Dynamic-Poselets

Sparse Coding on Local Spatial-Temporal Volumes for Human Action Recognition

Human Action Recognition by Using Polyhedron Model-Based Spatio-Temporal Gradient Descriptor

Combining Sparse And Dense Descriptors With Temporal Semantic Structures For Robust Human Action Recognition

Local Spatiotemporal Coding and Sparse Representation Based Human Action Recognition

Spatio-temporal Laplacian Pyramid Coding for Action Recognition.

Action Recognition Based on Spatial-Temporal Pyramid Sparse Coding.

Mining Spatiotemporal Video Patterns Towards Robust Action Retrieval

Actor-independent Action Search Using Spatiotemporal Vocabulary with Appearance Hashing.

Action Recognition by Spatio-Temporal Oriented Energies

Sparse Coding-Based Spatiotemporal Saliency for Action Recognition

Spatio-temporal Semantic Features for Human Action Recognition.

Human Action Recognition Based on Oriented Gradient Histogram of Slide Blocks on Spatio-Temporal Silhouette

A Fast Sub-Volume Search Method for Human Action Detection

Mining Spatial Temporal Saliency Structure For Action Recognition

Human Action Recognition under Log-Euclidean Riemannian Metric.

Robust Detection and Localization of Human Action in Video.

Efficient Search and Localization of Human Actions in Video Databases

Attention-driven Action Retrieval with DTW-based 3d Descriptor Matching.