Abstract:In this paper, we present a unified representation based on the spatio-temporal steerable pyramid (STSP) for the holistic representation of human actions. A video sequence is viewed as a spatio-temporal volume preserving all the appearance and motion information of an action in it. By decomposing the spatio-temporal volumes into band-passed sub-volumes, the spatio-temporal Laplacian pyramid provides an effective technique for multi-scale analysis of video sequences, and spatio-temporal patterns with different scales could be well localized and captured. To efficiently explore the underlying local spatio-temporal orientation structures at multiple scales, a bank of three-dimensional separable steerable filters are conducted on each of the sub-volume from the Laplacian pyramid. The outputs of the quadrature pair of steerable filters are squared and summed to yield a more robust oriented energy representation. To be further invariant and compact, a spatio-temporal max pooling operation is performed between responses of the filtering at adjacent scales and over spatio-temporal neighbourhoods. In order to capture the appearance, local geometric structure and motion of an action, we apply the STSP on the intensity, 3D gradients and optical flow of video sequences, yielding a unified holistic representation of human actions.Taking advantage of multi-scale, multi-orientation analysis and feature pooling, STSP produces a compact but informative and invariant representation of human actions. We conduct extensive experiments on the KTH, UCF Sports and HMDB51 datasets, which shows the unified STSP achieves comparable results with the state-of-the-art methods.

Beyond Spatial Pyramid Matching: Space-time Extended Descriptor for Action Recognition

Spatio-temporal Laplacian Pyramid Coding for Action Recognition.

Modeling Geometric-Temporal Context with Directional Pyramid Co-Occurrence for Action Recognition

Action Recognition Based on Spatial-Temporal Pyramid Sparse Coding.

Spatiotemporal Pyramid Network for Video Action Recognition

Temporal-Spatial Mapping for Action Recognition

Space-time Neighborhood Based Hierarchical Descriptor for Action Recognition

Temporal Action Localization with Pyramid of Score Distribution Features

Temporal adaptive feature pyramid network for action detection

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Video-ception Network: Towards Multi-Scale Efficient Asymmetric Spatial-Temporal Interactions

Efficient spatio-temporal network for action recognition

Human Action Recognition by Using Polyhedron Model-Based Spatio-Temporal Gradient Descriptor

Spatiotemporal Pyramid Pooling In 3d Convolutional Neural Networks For Action Recognition

Temporal Pyramid Pooling-Based Convolutional Neural Network for Action Recognition

Action Recognition Using 3D DAISY Descriptor

Action Recognition by Spatio-Temporal Oriented Energies

Collaborative Spatio-temporal Feature Learning for Video Action Recognition

Video Based Action Recognition Using Spatial and Temporal Feature

Detecting video events based on action recognition in complex scenes using spatio-temporal descriptor.

Temporal Sparse Feature Auto-Combination Deep Network for Video Action Recognition.