Abstract:The performance of action recognition in video sequences depends significantly on the representation of actions and the similarity measurement between the representations. In this paper, we combine two kinds of features extracted from the spatio-temporal interest points with context-aware kernels for action recognition. For the action representation, local cuboid features extracted around interest points are very popular using a Bag of Visual Words (BOVW) model. Such representations, however, ignore potentially valuable information about the global spatio-temporal distribution of interest points. We propose a new global feature to capture the detailed geometrical distribution of interest points. It is calculated by using the 3D ℛ transform which is defined as an extended 3D discrete Radon transform, followed by the application of a two-directional two-dimensional principal component analysis. For the similarity measurement, we model a video set as an optimized probabilistic hypergraph and propose a context-aware kernel to measure high order relationships among videos. The context-aware kernel is more robust to the noise and outliers in the data than the traditional context-free kernel which just considers the pairwise relationships between videos. The hyperedges of the hypergraph are constructed based on a learnt Mahalanobis distance metric. Any disturbing information from other classes is excluded from each hyperedge. Finally, a multiple kernel learning algorithm is designed by integrating the l_2 norm regularization into a linear SVM classifier to fuse the ℛ feature and the BOVW representation for action recognition. Experimental results on several datasets demonstrate the effectiveness of the proposed approach for action recognition.

Hyper-Fisher Vectors for Action Recognition

Action Recognition with Stacked Fisher Vectors.

A Joint Evaluation of Dictionary Learning and Feature Encoding for Action Recognition.

Good Practices for Learning to Recognize Actions Using FV and VLAD

Action Recognition Using Hybrid Feature Descriptor And Vlad Video Encoding

DA-VLAD: Discriminative Action Vector of Locally Aggregated Descriptors for Action Recognition

Hybrid super vector with improved dense trajectories for action recognition

Towards Good Practices for Action Video Encoding

Sparse Coding on Local Spatial-Temporal Volumes for Human Action Recognition

Agglomerative Clustering and Residual-VLAD Encoding for Human Action Recognition

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Fusing $${\mathcal {R}}$$R Features and Local Features with Context-Aware Kernels for Action Recognition

Fisher Vector Based CNN Architecture for Image Classification.

A Compact Representation of Human Actions by Sliding Coordinate Coding

Contextual Fisher Kernels for Human Action Recognition

Action Recognition with Uncertain VLAD

Human action recognition in videos using hybrid motion features

Action Recognition by Multiple Features and Hyper-Sphere Multi-class SVM

F2D-SIFPNet: a Frequency 2D Slow-I-Fast-P Network for Faster Compressed Video Action Recognition

Joint Feature Optimization and Fusion for Compressed Action Recognition